PHP tip: How to get a web page using the fopen wrappers

Technologies: PHP 4.0.4+

PHP’s fopen wrappers enable the standard file functions to read web pages from a web server. A few additional calls are needed to set parameters for a web server request and to get the server’s HTTP response header. This tip shows how.

This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at handling web page character encodings, extracting URLs, stripping away HTML syntax, punctuation, symbol characters, and numbers, and breaking a page down into a keyword list.

Code

The fopen wrappers are a standard feature from PHP 4.0.4 onwards. The wrappers extend the functionality of the file functions, such as fopen(), file(), and file_get_contents(), enabling them to access remote files on a web or FTP server.

The following sample code uses file_get_contents() to read a web page using a URL. Options for the file request are set with stream_context_create(), and the $http_response_header global variable holds the web server’s response header.

Download: get_web_page_fopen.zip

/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the header fields and content.
*/ function get_web_page( $url ) { $options = array( 'http' => array( 'user_agent' => 'spider', // who am i 'max_redirects' => 10, // stop after 10 redirects 'timeout' => 120, // timeout on response ) ); $context = stream_context_create( $options ); $page = @file_get_contents( $url, false, $context ); $result = array( ); if ( $page != false ) $result['content'] = $page; else if ( !isset( $http_response_header ) ) return null; // Bad url, timeout // Save the header $result['header'] = $http_response_header; // Get the *last* HTTP status code $nLines = count( $http_response_header ); for ( $i = $nLines-1; $i >= 0; $i-- ) { $line = $http_response_header[$i]; if ( strncasecmp( "HTTP", $line, 4 ) == 0 ) { $response = explode( ' ', $line ); $result['http_code'] = $response[1]; break; } } return $result; }

This sample function takes a URL argument and returns an associative array containing the web page header and content. The fopen wrappers automatically handle DNS lookups and page redirects.

The returned array contains:

"http_code" the page status code (e.g. "200" on success)
"header" the header as an array with one entry per header line
"content" the page content (e.g. HTML text, image bytes, etc.)

On success, "http_code" is 200, and "content" contains the web page.

On an error with a bad URL, unknown host, a timeout, or a redirect loop, a null is returned.

On an error with the web site, such as a missing page or no permissions, "http_code" has a non-200 HTTP status code, and "content" contains the site’s error message page (see Wikipedia’s List of HTTP status codes).

Example

Read a web page and check for errors:

$result = get_web_page( $url );
if ( $result == null )
... error: bad url, timeout, redirect loop ...

if ( $result['http_code'] != 200 )
... error: no page, no permissions, no service ...

$page = $result['content'];

Explanation

The fopen wrappers are a retrofit to the standard PHP file functions to enable them to get files managed by a web or FTP server. To maintain compatibility with previous versions of PHP, the retrofit made only minimal changes to the file functions. Unfortunately, the result is a somewhat confused set of calls to set up a file transfer, execute the transfer, and get the results. There are also many variations, depending upon which file reading functions you wish to use.

The fopen wrappers are available in PHP 4.0.4 and later. They must be enabled in the php.ini file by setting allow_url_fopen to TRUE. For most installations, this is the default.

Getting a web page using the fopen wrappers always includes these steps:

  1. Set up the request using stream_context_create() and optionally stream_context_set_option().
  2. Request the page using file functions such as file_get_contents(), file(), and fopen()/fread().
  3. Get the HTTP header using stream_get_meta_data() or $http_response_header.

stream_create_context() and stream_context_set_option()

The fopen wrappers use a “stream context” that contains options for a file request. The stream_context_create() function creates a new context, initializing it with an array of options. The stream_context_set_option() function can be used to set more options in the context before opening the file.

The stream context options available depend upon the “wrapper” used. Appendix O of the PHP manual has a list of standard wrappers, such as those for HTTP and FTP protocols. For getting a web page, use the “http” wrapper and its options.

The HTTP wrapper options used depend upon the task you want to do. For getting a simple web page, these are the essentials:

"user_agent" the application’s name, such as a web browser (if left empty, some web servers will reject the page request).
"max_redirects" the maximum number of redirects to follow.
"timeout" the maximum time, in seconds, the call will wait on a server response.

To set these options, create an associative array with one entry with a key of “http” and a value that is another associative array. Initialize that second array with the above keys and appropriate values. For example:

$options = array( 'http' => array(
        'user_agent'    => 'spider',        // who am i
        'max_redirects' => 10,              // stop after 10 redirects
        'timeout'       => 120,             // timeout on response
) );
$context = stream_context_create( $options );

file_get_contents(), file(), or fopen()/fread()

Each file function that opens a file has an optional stream context argument. For instance, use the file_get_contents() function like this:

$text = file_get_contents( $url, false, $context );

Use the file() function like this:

$lines = file( $url, 0, $context );

Use fopen(), fread(), and fclose() like this:

$handle = fopen( $url, "r", false, $context );
$text = '';
while ( !feof( $handle ) )
    $text .= fread( $handle, 8192 );
fclose( $handle ); 

Or use fopen(), stream_get_contents(), and fclose() like this:

$handle = fopen( $url, "r", false, $context );
$text = stream_get_contents( $handle );
fclose( $handle ); 

There are more variations, depending upon the file functions you use.

stream_get_meta_data() and $http_response_header

Opening a file sends a file request to the web server using the options in the stream context. The web server responds with a “response header” and (if there is no error) the content of the file. The file functions handle the file’s content, while the header is available separately. The header is essential: it tells you the HTTP status code indicating if an error occurred, and it provides the file’s content type, including its MIME type and character set encoding.

There are two ways to get the header:

  1. Call the stream_get_meta_data() function using a stream handle from fopen().
  2. Read the global $http_response_header variable.

The stream_get_meta_data() function is the preferred method, but its argument is a stream handle returned by fopen(). If you instead used the file() or file_get_contents() functions to open and read a file, neither function returns a stream handle. Instead, when using these file functions you must use the $http_response_header variable to get the server’s response header.

The stream_get_meta_data() function returns an associative array of status information. The "wrapper_data" field contains an array with one entry for each line in the header. This is the same data as in the $http_response_header. Here’s a typical header array:

Array
(
[0] => HTTP/1.1 200 OK
[1] => Date: Sun, 08 Jul 2007 02:36:18 GMT
[2] => Server: Apache
[3] => Vary: Accept-Encoding
[4] => Expires: Sun, 19 Nov 1978 05:00:00 GMT
[5] => Last-Modified: Sun, 08 Jul 2007 02:36:19 GMT
[6] => Cache-Control: no-store, no-cache, must-revalidate
[7] => Cache-Control: post-check=0, pre-check=0
[8] => Connection: close
[9] => Content-Type: text/html; charset=utf-8
[10] => Content-Language: en
)

The number of lines in the header, and their order, may vary. The formatting of the lines, however, is standardized and described in HTTP specification’s Header Field Definitions.

If the web page you asked for was redirected to another web page (and another and another), there will be multiple groups of header lines. Each group will start with a new “HTTP” line with a status code of 301 (moved permanently) or 307 (moved temporarily). The first group will be for the original address, the next group for the redirected-to address, and so on. The last group will be for the returned page.

Errors

There are two kinds of errors:

  • System errors from a malformed URL, bad protocol, unknown host, timeout, or redirect loop.
  • Site errors from an unknown web page, insufficient permissions, or a web server that is down for maintenance.

On a system error, the file functions return FALSE. There will be no HTTP header or content.

On a site error or success, the file functions returns a non-FALSE value (such as the text of a web page from file_get_contents()). The header’s last occurrence of a line starting with “HTTP” contains the server’s HTTP status code for the returned page (earlier “HTTP” lines, if any, are for redirected addresses):

HTTP/1.1 200 OK  

Wikipedia has a List of HTTP status codes. On success, the status is 200. The code is 404 if the page is not found, 401 if authentication is required, 503 if the site is down, etc.

Alternatives

The fopen wrappers are an awkward retrofit to the PHP file functions to enable them to read remote files. There are two main alternatives:

  • CURL (Client URL) functions. The CURL library is a standard PHP extension designed to handle HTTP, FTP, and other protocol requests more efficiently and flexibly than with the fopen wrappers. It is included in most PHP distributions. See the PHP tip: How to get a web page using CURL for code to use CURL.
  • HTTP functions. The HTTP extension for PHP 5 and later provides classes and functions to handle HTTP requests. The extension is built atop CURL, but provides a nicer interface. The extension is not included in standard PHP distributions, but it can be added easily.

Downloads

Further reading

Related articles at NadeauSoftware.com

Web articles and specifications

Comments

stream context params in file_get_contents()

Great article. I have been using file_get_contents() with fopen wrappers for a long time, but was completely unaware of the stream_content parameters. Thanks!

However, it should be noted that while the fopen wrappers have been around since PHP 4.0.4, the stream context parameter to file_get_contents() are not available until PHP 5.x. I was so excited to try it that I didn't even see the note in the PHP reference. Spent some time trying to figure out why my file_get_contents() calls were failing even though the resource at the requested URL seemed to be delivered fine via other means like a web browser.

Anyway, thanks for great article. I will no doubt use the ideas here when my customer's hosting environment moves up to PHP 5.

Cheers!

thanks, i use fopen

I use fopen since my local server doesn't support Curl

thanks for the wrapper

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting