The first step when building a PHP search engine, link checker, or keyword extractor is to get the web page from the web server. There are several ways to do this. From PHP 4 onwards, the most flexible way uses PHP’s CURL (Client URL) functions. This tip shows how.
Table of Contents
This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at handling web page character encodings, extracting URLs, stripping away HTML syntax, punctuation, symbol characters, and numbers, and breaking a page down into a keyword list.
Code
The CURL (Client URL) library is a standard PHP extension, for PHP 4.0.3 and later, that is used to issue web page requests and handle web server responses. Use the curl_* functions to set options, execute a page request, check for errors, and return a page and server header.
Download: get_web_page_curl.zip
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
function get_web_page( $url )
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
This sample function takes a URL argument and returns an associative array containing the server header and content. The CURL functions automatically handle DNS lookups, redirects, cookies, and file decompression.
Header values are described in the curl_getinfo() manual. Highlights include:
"url"the final page URL after redirects "content_type"the content type (e.g. "text/html; charset=utf-8") "http_code"the page status code (e.g. "200" on success) "filetime"the date stamp on the remote file
This function adds::
"errno"the CURL error number (0 on success) "errmsg"the CURL error message for the error number "content"the page content (e.g. HTML text, image bytes, etc.)
On success, "errno" is 0, "http_code" is 200, and "content" contains the web page.
On an error with a bad URL, unknown host, timeout, or redirect loop, "errno" has a non-zero error code and "errmsg" has an error message (see the CURL error code list).
On an error with a missing web page or insufficient permissions, "errno" is 0, "http_code" has a non-200 HTTP status code, and "content" contains the site’s error message page (see the Wikipedia List of HTTP status codes).
This function can be extended to support GET and POST for web forms, file uploads, logins, SSL for encrypted web pages, and access through proxy servers.
Example
Read a web page and check for errors:
$result = get_web_page( $url );
if ( $result['errno'] != 0 )
... error: bad url, timeout, redirect loop ...
if ( $result['http_code'] != 200 )
... error: no page, no permissions, no service ...
$page = $result['content'];
Explanation
The CURL functions are a standard extension for PHP 4.0.2 and later. They are included in most PHP installations. You can check your installation by calling phpinfo() and looking for --with-curl in the Configure Command section at the start of its output. The CURL library manual has more installation information.
The CURL functions are built atop libcurl, a cross-platform library also available for Perl, Python, Ruby, C/C++, Java, and other languages. The PHP documentation is an abbreviated form of that available at the libcurl web site.
Getting a web page using CURL always includes these steps:
- Create a CURL handle using
curl_init(). - Set up the request using
curl_setopt()orcurl_setopt_array(). - Request the page using
curl_exec(). - Check if an error occurred using
curl_errno(). - Get the HTTP header using
curl_getinfo(). - Close the CURL handle using
curl_close().
curl_init()
The curl_init() function creates a CURL handle to manage the web page request. The only function argument is a URL, which is saved until a later curl_exec() call executes the request.
curl_setopt() or curl_setopt_array()
The curl_setopt() and curl_setopt_array() functions configure the page request by setting options and their values. These two PHP functions are equivalent: curl_setopt() sets one option at a time, while curl_setopt_array() sets a list of options all at once (PHP 5.1.3+ required).
The manual has a long list of available options for different uses of CURL. Here are the basics for getting a web page:
Essential options:
CURLOPT_RETURNTRANSFER. When true, CURL returns the web page content fromcurl_exec(). When false, CURL prints the page to the screen (which is hardly ever useful).CURLOPT_HEADER. When true, CURL includes the web server’s HTTP response header in the page. When false, it is excluded, making the returned page easier to parse later. The header is still available by callingcurl_getinfo().
Important options:
CURLOPT_FOLLOWLOCATION. When true, CURL follows web page redirects automatically. When false, it stops on the first redirect and returns an error.CURLOPT_ENCODING. When empty, CURL handles uncompressed and compressed transfers. Compressed pages are automatically decompressed before they’re returned bycurl_exec().
Web etiquette options:
CURLOPT_USERAGENT. The user agent is the name of the application making the request (such as a web browser). If left empty, some web servers will reject the request.CURLOPT_AUTOREFERER. The referer is a URL for the web page that linked to the requested web page. When following redirects, set this to true and CURL automatically fills in the URL of the page being redirected away from.
Error handling options:
CURLOPT_CONNECTTIMEOUT. Fail if a web server doesn’t respond to a connection within a time limit (seconds).CURLOPT_TIMEOUT. Fail if a web server doesn’t return the web page within a time limit (seconds).CURLOPT_MAXREDIRS. If a web page redirect leads to another redirect, and another, stop after a maximum number of redirects.
curl_exec()
The curl_exec() function issues the request and returns the web page (HTML, XHTML, XML, image, etc.).
curl_errno() and curl_error()
There are two kinds of errors:
- System errors result from a bad URL, bad protocol, unknown host, connect timeout, response timeout, or redirect loop. On a system error,
curl_errno()returns a non-zero error code (see the CURL error code list) andcurl_error()returns a short error message suitable for debugging output. - Site errors result from an unknown web page, insufficient permissions, or a web server that is down for maintenance. On a site error or success,
curl_errno()returns zero and the web server’s HTTP status code is available in the server header returned bycurl_getinfo()(see below).
curl_getinfo()
The curl_getinfo() function returns values from the server header, such as the web page’s content type and the HTTP status code. See the manual for a list of header fields returned. Highlights include:
"url"the final web page URL after redirects "content_type"the content type (e.g. "text/html; charset=utf-8") "http_code"the web page status code (e.g. "200" on success) "filetime"the date stamp on the remote file
The "http_code" entry has the HTTP status code (see Wikipedia’s List of HTTP status codes). On success, the status is 200. The code is 404 if the web page is not found, 401 if authentication is required, 503 if the web site is down, etc.
curl_close()
The curl_close() function destroys the CURL handle, freeing memory.
Alternatives
The CURL library is one way to read pages from a web server. There are two main alternatives:
- Fopen wrappers. The fopen wrappers extend PHP's file reading functions to handle getting files via HTTP and FTP protocols. They are less powerful than CURL, but they are included in all PHP distributions. See the PHP tip: How to get a web page using the fopen wrappers for code to use these functions.
- HTTP functions. The HTTP extension for PHP 5 and later provides classes and functions to handle HTTP requests. The extension is built atop CURL, but provides a nicer interface. The extension is not included in standard PHP distributions, but it can be added easily.
Downloads
- get_web_page_curl.zip
- Includes
get_web_page.php. The code is covered by the OSI BSD license so you can use, modify, redistribute, and sell as you see fit.
- Includes
Further reading
Related articles at NadeauSoftware.com
- PHP tip: How to get a web page’s content type. Extract the MIME type and character set from an HTTP header or from the web page content.
- PHP tip: How to get a web page using the fopen wrappers. Use PHP's file reading functions to get a web page, handling web server redirects and user-agent strings.
- PHP tip: How to extract keywords from a web page. Get a good list of keywords from a web page by getting the web page text, converting it to UTF-8, stripping away HTML tags, punctuation, symbols, and numbers, and breaking the text into words.
Web articles and specifications
- Hypertext Transfer Protocol - HTTP/1.1. The W3C maintains the official specification for communications with web servers. The document explains (in detail) the request and response headers and HTTP status codes.
- List of HTTP status codes. Wikipedia has a summary of HTTP status codes and what they mean.
- CURL, Client URL Library Functions. The PHP manual describes the CURL functions in more detail. The same functions can handle FTP, HTTPS, GET and POST of forms, file uploads, password authentication, and more.
- CURL groks URLs. The Open Source CURL project provides the underlying code for PHP’s CURL functions. The project’s documentation includes more information about CURL, command-line tools, etc.
- PHP/CURL Examples Collection. The CURL project includes several PHP examples showing how to do GET and POST with forms, use cookies, and more.

Curl set up is very valuble
I spent a lot of time to search curllib file and Installed on my XP system. But It wont work, Duting the search I got this site and read the valuable content. I know by your site. It is very simple to active the curl function. Just goto php.ini file and remove the comment on extension=php_curl.dll line. It is working fine. Note your PHP version must be above 4.3x.
Thanks to the site content and site organizer.
channappagoudar, Bangalore, India
An Alternative for the OOP PHP Coders
I have an easy copy/paste alternative for the PHP OOP Coders: a PHP cUrl Class ready for use.
The best route is to study this page and understand the way it all works and then use my script and enjoy the ease of Object Oriented Programming.
If you understand how it works you can make changes to my class easily, if you don't, it's very easy to just use it in your projects to achieve your goals.
After all, if a script serves its purpose, some may not even care how it works ;)
Is there a way to get all
Is there a way to get all text on a page, such as that text dynamicaly placed on the page through DIV tags. Is there a way to just do a "select all" and copy of all text on a page through PHP.
Getting "all" of the text
There are several ways to interpret your question...
Yes. Use the CURL code explained in this article to download the full HTML web page from a server.
Yes. The plain text on a page is basically the HTML page with the HTML tags properly stripped away. See the first part of this article on How to extract keywords from a web page for the steps needed to get the plain text on a page.
Yes. The page you download with CURL is the full web page assembled by the server. If the server uses scripts to dynamically generate pages from database content, those scripts will be run by the server and you'll get the result.
No. A web page with embedded JavaScript is actually a program. CURL gives you the program's source code (HTML and JavaScript), but doesn't run that program. To run a page's embedded JavaScript you need (1) a JavaScript interpreter, and (2) the Document Object Model (DOM) for the page. Browsers have these, but PHP does not. People are working on PHP versions of these, but developing these are big tasks. If this is what you need, you might skip PHP and instead look at writing C++ code using WebKit.
Possibly, but I haven't done it. Page readers for the visually impaired read the page text shown by a web browser. They do this as a browser plug-in, so it is conceivable that you could write (or find) a plug-in to help you get the page text. My guess, however, is that this would be a very involved task.
It's possible you could use remote-management/screen-sharing/testing tools that let you send fake keyboard and mouse events to an application. Sending a "select all" and "copy" to a browser might get you the page text you want. But this too seems pretty involved and probably isn't a robust solution.
Finally, I should point out that JavaScripts can run throughout the time a page is being displayed. A "select all" and "copy" scheme only gets you a snapshot of the page at the current moment.
Post new comment