PHP tip: How to get a web page using CURL

Technologies: PHP 4.0.3+

The first step when building a PHP search engine, link checker, or keyword extractor is to get the web page from the web server. There are several ways to do this. From PHP 4 onwards, the most flexible way uses PHP’s CURL (Client URL) functions. This tip shows how.

This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at handling web page character encodings, extracting URLs, stripping away HTML syntax, punctuation, symbol characters, and numbers, and breaking a page down into a keyword list.

Code

The CURL (Client URL) library is a standard PHP extension, for PHP 4.0.3 and later, that is used to issue web page requests and handle web server responses. Use the curl_* functions to set options, execute a page request, check for errors, and return a page and server header.

Download: get_web_page_curl.zip

/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
function get_web_page( $url )
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);

$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );


$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}

This sample function takes a URL argument and returns an associative array containing the server header and content. The CURL functions automatically handle DNS lookups, redirects, cookies, and file decompression.

Header values are described in the curl_getinfo() manual. Highlights include:

"url" the final page URL after redirects
"content_type" the content type (e.g. "text/html; charset=utf-8")
"http_code" the page status code (e.g. "200" on success)
"filetime" the date stamp on the remote file

This function adds::

"errno" the CURL error number (0 on success)
"errmsg" the CURL error message for the error number
"content" the page content (e.g. HTML text, image bytes, etc.)

On success, "errno" is 0, "http_code" is 200, and "content" contains the web page.

On an error with a bad URL, unknown host, timeout, or redirect loop, "errno" has a non-zero error code and "errmsg" has an error message (see the CURL error code list).

On an error with a missing web page or insufficient permissions, "errno" is 0, "http_code" has a non-200 HTTP status code, and "content" contains the site’s error message page (see the Wikipedia List of HTTP status codes).

This function can be extended to support GET and POST for web forms, file uploads, logins, SSL for encrypted web pages, and access through proxy servers.

Example

Read a web page and check for errors:

$result = get_web_page( $url );
if ( $result['errno'] != 0 )
... error: bad url, timeout, redirect loop ...

if ( $result['http_code'] != 200 )
... error: no page, no permissions, no service ...

$page = $result['content'];

Explanation

The CURL functions are a standard extension for PHP 4.0.2 and later. They are included in most PHP installations. You can check your installation by calling phpinfo() and looking for --with-curl in the Configure Command section at the start of its output. The CURL library manual has more installation information.

The CURL functions are built atop libcurl, a cross-platform library also available for Perl, Python, Ruby, C/C++, Java, and other languages. The PHP documentation is an abbreviated form of that available at the libcurl web site.

Getting a web page using CURL always includes these steps:

  1. Create a CURL handle using curl_init().
  2. Set up the request using curl_setopt() or curl_setopt_array().
  3. Request the page using curl_exec().
  4. Check if an error occurred using curl_errno().
  5. Get the HTTP header using curl_getinfo().
  6. Close the CURL handle using curl_close().

curl_init()

The curl_init() function creates a CURL handle to manage the web page request. The only function argument is a URL, which is saved until a later curl_exec() call executes the request.

curl_setopt() or curl_setopt_array()

The curl_setopt() and curl_setopt_array() functions configure the page request by setting options and their values. These two PHP functions are equivalent: curl_setopt() sets one option at a time, while curl_setopt_array() sets a list of options all at once (PHP 5.1.3+ required).

The manual has a long list of available options for different uses of CURL. Here are the basics for getting a web page:

Essential options:

  • CURLOPT_RETURNTRANSFER. When true, CURL returns the web page content from curl_exec(). When false, CURL prints the page to the screen (which is hardly ever useful).
  • CURLOPT_HEADER. When true, CURL includes the web server’s HTTP response header in the page. When false, it is excluded, making the returned page easier to parse later. The header is still available by calling curl_getinfo().

Important options:

  • CURLOPT_FOLLOWLOCATION. When true, CURL follows web page redirects automatically. When false, it stops on the first redirect and returns an error.
  • CURLOPT_ENCODING. When empty, CURL handles uncompressed and compressed transfers. Compressed pages are automatically decompressed before they’re returned by curl_exec().

Web etiquette options:

  • CURLOPT_USERAGENT. The user agent is the name of the application making the request (such as a web browser). If left empty, some web servers will reject the request.
  • CURLOPT_AUTOREFERER. The referer is a URL for the web page that linked to the requested web page. When following redirects, set this to true and CURL automatically fills in the URL of the page being redirected away from.

Error handling options:

  • CURLOPT_CONNECTTIMEOUT. Fail if a web server doesn’t respond to a connection within a time limit (seconds).
  • CURLOPT_TIMEOUT. Fail if a web server doesn’t return the web page within a time limit (seconds).
  • CURLOPT_MAXREDIRS. If a web page redirect leads to another redirect, and another, stop after a maximum number of redirects.

curl_exec()

The curl_exec() function issues the request and returns the web page (HTML, XHTML, XML, image, etc.).

curl_errno() and curl_error()

There are two kinds of errors:

  • System errors result from a bad URL, bad protocol, unknown host, connect timeout, response timeout, or redirect loop. On a system error, curl_errno() returns a non-zero error code (see the CURL error code list) and curl_error() returns a short error message suitable for debugging output.
  • Site errors result from an unknown web page, insufficient permissions, or a web server that is down for maintenance. On a site error or success, curl_errno() returns zero and the web server’s HTTP status code is available in the server header returned by curl_getinfo() (see below).

curl_getinfo()

The curl_getinfo() function returns values from the server header, such as the web page’s content type and the HTTP status code. See the manual for a list of header fields returned. Highlights include:

"url" the final web page URL after redirects
"content_type" the content type (e.g. "text/html; charset=utf-8")
"http_code" the web page status code (e.g. "200" on success)
"filetime" the date stamp on the remote file

The "http_code" entry has the HTTP status code (see Wikipedia’s List of HTTP status codes). On success, the status is 200. The code is 404 if the web page is not found, 401 if authentication is required, 503 if the web site is down, etc.

curl_close()

The curl_close() function destroys the CURL handle, freeing memory.

Alternatives

The CURL library is one way to read pages from a web server. There are two main alternatives:

  • Fopen wrappers. The fopen wrappers extend PHP's file reading functions to handle getting files via HTTP and FTP protocols. They are less powerful than CURL, but they are included in all PHP distributions. See the PHP tip: How to get a web page using the fopen wrappers for code to use these functions.
  • HTTP functions. The HTTP extension for PHP 5 and later provides classes and functions to handle HTTP requests. The extension is built atop CURL, but provides a nicer interface. The extension is not included in standard PHP distributions, but it can be added easily.

Downloads

Further reading

Related articles at NadeauSoftware.com

Web articles and specifications

Curl set up is very valuble

I spent a lot of time to search curllib file and Installed on my XP system. But It wont work, Duting the search I got this site and read the valuable content. I know by your site. It is very simple to active the curl function. Just goto php.ini file and remove the comment on extension=php_curl.dll line. It is working fine. Note your PHP version must be above 4.3x.

Thanks to the site content and site organizer.

channappagoudar, Bangalore, India

An Alternative for the OOP PHP Coders

I have an easy copy/paste alternative for the PHP OOP Coders: a PHP cUrl Class ready for use.
The best route is to study this page and understand the way it all works and then use my script and enjoy the ease of Object Oriented Programming.
If you understand how it works you can make changes to my class easily, if you don't, it's very easy to just use it in your projects to achieve your goals.

After all, if a script serves its purpose, some may not even care how it works ;)

Is there a way to get all

Is there a way to get all text on a page, such as that text dynamicaly placed on the page through DIV tags. Is there a way to just do a "select all" and copy of all text on a page through PHP.

Getting "all" of the text

There are several ways to interpret your question...

  • Can you get all HTML text on a page?
    Yes. Use the CURL code explained in this article to download the full HTML web page from a server.
  • Can you get all plain text on a page?
    Yes. The plain text on a page is basically the HTML page with the HTML tags properly stripped away. See the first part of this article on How to extract keywords from a web page for the steps needed to get the plain text on a page.
  • Can you get all dynamic text on a page, as added by scripts run by the server?
    Yes. The page you download with CURL is the full web page assembled by the server. If the server uses scripts to dynamically generate pages from database content, those scripts will be run by the server and you'll get the result.
  • Can you get all dynamic text on a page, as added by JavaScript run by the browser?
    No. A web page with embedded JavaScript is actually a program. CURL gives you the program's source code (HTML and JavaScript), but doesn't run that program. To run a page's embedded JavaScript you need (1) a JavaScript interpreter, and (2) the Document Object Model (DOM) for the page. Browsers have these, but PHP does not. People are working on PHP versions of these, but developing these are big tasks. If this is what you need, you might skip PHP and instead look at writing C++ code using WebKit.
  • Can you do a "select all" on a web browser's displayed page?
    Possibly, but I haven't done it. Page readers for the visually impaired read the page text shown by a web browser. They do this as a browser plug-in, so it is conceivable that you could write (or find) a plug-in to help you get the page text. My guess, however, is that this would be a very involved task.

    It's possible you could use remote-management/screen-sharing/testing tools that let you send fake keyboard and mouse events to an application. Sending a "select all" and "copy" to a browser might get you the page text you want. But this too seems pretty involved and probably isn't a robust solution.

    Finally, I should point out that JavaScripts can run throughout the time a page is being displayed. A "select all" and "copy" scheme only gets you a snapshot of the page at the current moment.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting