PHP’s fopen wrappers enable the standard file functions to read web pages from a web server. A few additional calls are needed to set parameters for a web server request and to get the server’s HTTP response header. This tip shows how.
Table of Contents
This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at handling web page character encodings, extracting URLs, stripping away HTML syntax, punctuation, symbol characters, and numbers, and breaking a page down into a keyword list.
Code
The fopen wrappers are a standard feature from PHP 4.0.4 onwards. The wrappers extend the functionality of the file functions, such as fopen(), file(), and file_get_contents(), enabling them to access remote files on a web or FTP server.
The following sample code uses file_get_contents() to read a web page using a URL. Options for the file request are set with stream_context_create(), and the $http_response_header global variable holds the web server’s response header.
Download: get_web_page_fopen.zip
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the header fields and content.
*/ function get_web_page( $url ) { $options = array( 'http' => array( 'user_agent' => 'spider', // who am i 'max_redirects' => 10, // stop after 10 redirects 'timeout' => 120, // timeout on response ) ); $context = stream_context_create( $options ); $page = @file_get_contents( $url, false, $context ); $result = array( ); if ( $page != false ) $result['content'] = $page; else if ( !isset( $http_response_header ) ) return null; // Bad url, timeout // Save the header $result['header'] = $http_response_header; // Get the *last* HTTP status code $nLines = count( $http_response_header ); for ( $i = $nLines-1; $i >= 0; $i-- ) { $line = $http_response_header[$i]; if ( strncasecmp( "HTTP", $line, 4 ) == 0 ) { $response = explode( ' ', $line ); $result['http_code'] = $response[1]; break; } } return $result; }
This sample function takes a URL argument and returns an associative array containing the web page header and content. The fopen wrappers automatically handle DNS lookups and page redirects.
The returned array contains:
"http_code" the page status code (e.g. "200" on success) "header" the header as an array with one entry per header line "content" the page content (e.g. HTML text, image bytes, etc.)
On success, "http_code" is 200, and "content" contains the web page.
On an error with a bad URL, unknown host, a timeout, or a redirect loop, a null is returned.
On an error with the web site, such as a missing page or no permissions, "http_code" has a non-200 HTTP status code, and "content" contains the site’s error message page (see Wikipedia’s List of HTTP status codes).
Example
Read a web page and check for errors:
$result = get_web_page( $url );
if ( $result == null )
... error: bad url, timeout, redirect loop ...
if ( $result['http_code'] != 200 )
... error: no page, no permissions, no service ...
$page = $result['content'];
Explanation
The fopen wrappers are a retrofit to the standard PHP file functions to enable them to get files managed by a web or FTP server. To maintain compatibility with previous versions of PHP, the retrofit made only minimal changes to the file functions. Unfortunately, the result is a somewhat confused set of calls to set up a file transfer, execute the transfer, and get the results. There are also many variations, depending upon which file reading functions you wish to use.
The fopen wrappers are available in PHP 4.0.4 and later. They must be enabled in the php.ini file by setting allow_url_fopen to TRUE. For most installations, this is the default.
Getting a web page using the fopen wrappers always includes these steps:
- Set up the request using
stream_context_create()and optionallystream_context_set_option(). - Request the page using file functions such as
file_get_contents(),file(), andfopen()/fread(). - Get the HTTP header using
stream_get_meta_data()or$http_response_header.
stream_create_context() and stream_context_set_option()
The fopen wrappers use a “stream context” that contains options for a file request. The stream_context_create() function creates a new context, initializing it with an array of options. The stream_context_set_option() function can be used to set more options in the context before opening the file.
The stream context options available depend upon the “wrapper” used. Appendix O of the PHP manual has a list of standard wrappers, such as those for HTTP and FTP protocols. For getting a web page, use the “http” wrapper and its options.
The HTTP wrapper options used depend upon the task you want to do. For getting a simple web page, these are the essentials:
"user_agent" the application’s name, such as a web browser (if left empty, some web servers will reject the page request). "max_redirects" the maximum number of redirects to follow. "timeout" the maximum time, in seconds, the call will wait on a server response.
To set these options, create an associative array with one entry with a key of “http” and a value that is another associative array. Initialize that second array with the above keys and appropriate values. For example:
$options = array( 'http' => array(
'user_agent' => 'spider', // who am i
'max_redirects' => 10, // stop after 10 redirects
'timeout' => 120, // timeout on response
) );
$context = stream_context_create( $options );
file_get_contents(), file(), or fopen()/fread()
Each file function that opens a file has an optional stream context argument. For instance, use the file_get_contents() function like this:
$text = file_get_contents( $url, false, $context );
Use the file() function like this:
$lines = file( $url, 0, $context );
Use fopen(), fread(), and fclose() like this:
$handle = fopen( $url, "r", false, $context );
$text = '';
while ( !feof( $handle ) )
$text .= fread( $handle, 8192 );
fclose( $handle );
Or use fopen(), stream_get_contents(), and fclose() like this:
$handle = fopen( $url, "r", false, $context ); $text = stream_get_contents( $handle ); fclose( $handle );
There are more variations, depending upon the file functions you use.
stream_get_meta_data() and $http_response_header
Opening a file sends a file request to the web server using the options in the stream context. The web server responds with a “response header” and (if there is no error) the content of the file. The file functions handle the file’s content, while the header is available separately. The header is essential: it tells you the HTTP status code indicating if an error occurred, and it provides the file’s content type, including its MIME type and character set encoding.
There are two ways to get the header:
- Call the
stream_get_meta_data()function using a stream handle fromfopen(). - Read the global
$http_response_headervariable.
The stream_get_meta_data() function is the preferred method, but its argument is a stream handle returned by fopen(). If you instead used the file() or file_get_contents() functions to open and read a file, neither function returns a stream handle. Instead, when using these file functions you must use the $http_response_header variable to get the server’s response header.
The stream_get_meta_data() function returns an associative array of status information. The "wrapper_data" field contains an array with one entry for each line in the header. This is the same data as in the $http_response_header. Here’s a typical header array:
Array
(
[0] => HTTP/1.1 200 OK
[1] => Date: Sun, 08 Jul 2007 02:36:18 GMT
[2] => Server: Apache
[3] => Vary: Accept-Encoding
[4] => Expires: Sun, 19 Nov 1978 05:00:00 GMT
[5] => Last-Modified: Sun, 08 Jul 2007 02:36:19 GMT
[6] => Cache-Control: no-store, no-cache, must-revalidate
[7] => Cache-Control: post-check=0, pre-check=0
[8] => Connection: close
[9] => Content-Type: text/html; charset=utf-8
[10] => Content-Language: en
)
The number of lines in the header, and their order, may vary. The formatting of the lines, however, is standardized and described in HTTP specification’s Header Field Definitions.
If the web page you asked for was redirected to another web page (and another and another), there will be multiple groups of header lines. Each group will start with a new “HTTP” line with a status code of 301 (moved permanently) or 307 (moved temporarily). The first group will be for the original address, the next group for the redirected-to address, and so on. The last group will be for the returned page.
Errors
There are two kinds of errors:
- System errors from a malformed URL, bad protocol, unknown host, timeout, or redirect loop.
- Site errors from an unknown web page, insufficient permissions, or a web server that is down for maintenance.
On a system error, the file functions return FALSE. There will be no HTTP header or content.
On a site error or success, the file functions returns a non-FALSE value (such as the text of a web page from file_get_contents()). The header’s last occurrence of a line starting with “HTTP” contains the server’s HTTP status code for the returned page (earlier “HTTP” lines, if any, are for redirected addresses):
HTTP/1.1 200 OK
Wikipedia has a List of HTTP status codes. On success, the status is 200. The code is 404 if the page is not found, 401 if authentication is required, 503 if the site is down, etc.
Alternatives
The fopen wrappers are an awkward retrofit to the PHP file functions to enable them to read remote files. There are two main alternatives:
- CURL (Client URL) functions. The CURL library is a standard PHP extension designed to handle HTTP, FTP, and other protocol requests more efficiently and flexibly than with the fopen wrappers. It is included in most PHP distributions. See the PHP tip: How to get a web page using CURL for code to use CURL.
- HTTP functions. The HTTP extension for PHP 5 and later provides classes and functions to handle HTTP requests. The extension is built atop CURL, but provides a nicer interface. The extension is not included in standard PHP distributions, but it can be added easily.
Downloads
- get_web_page_fopen.zip
- Includes
get_web_page.php. The code is covered by the OSI BSD license so you can use, modify, redistribute, and sell as you see fit.
- Includes
Further reading
Related articles at NadeauSoftware.com
- PHP tip: How to get a web page using CURL. Use PHP’s CURL (Client URL) functions to get a web page, handling web server redirects, compressed content, cookies, and user-agent strings.
- PHP tip: How to get a web page’s content type. Get the MIME type and character set from an HTTP header or from the web page content.
- PHP tip: How to extract keywords from a web page. Get a good list of keywords from a web page by getting the web page text, converting it to UTF-8, stripping away HTML tags, punctuation, symbols, and numbers, and breaking the text into words.
Web articles and specifications
- Hypertext Transfer Protocol - HTTP/1.1. W3C maintains the official specification for communications with web servers. The document explains (in detail) the request and response headers and HTTP status codes.
- List of HTTP status codes. Wikipedia has a summary of HTTP status codes and what they mean.

stream context params in file_get_contents()
Great article. I have been using file_get_contents() with fopen wrappers for a long time, but was completely unaware of the stream_content parameters. Thanks!
However, it should be noted that while the fopen wrappers have been around since PHP 4.0.4, the stream context parameter to file_get_contents() are not available until PHP 5.x. I was so excited to try it that I didn't even see the note in the PHP reference. Spent some time trying to figure out why my file_get_contents() calls were failing even though the resource at the requested URL seemed to be delivered fine via other means like a web browser.
Anyway, thanks for great article. I will no doubt use the ideas here when my customer's hosting environment moves up to PHP 5.
Cheers!
Post new comment