PHP tip: How to get a web page using CURL

Technologies: PHP 4.0.3+

The first step when building a PHP search engine, link checker, or keyword extractor is to get the web page from the web server. There are several ways to do this. From PHP 4 onwards, the most flexible way uses PHP’s CURL (Client URL) functions. This tip shows how.

This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at handling web page character encodings, extracting URLs, stripping away HTML syntax, punctuation, symbol characters, and numbers, and breaking a page down into a keyword list.

Code

The CURL (Client URL) library is a standard PHP extension, for PHP 4.0.3 and later, that is used to issue web page requests and handle web server responses. Use the curl_* functions to set options, execute a page request, check for errors, and return a page and server header.

Download: get_web_page_curl.zip

/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
function get_web_page( $url )
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);

$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );


$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}

This sample function takes a URL argument and returns an associative array containing the server header and content. The CURL functions automatically handle DNS lookups, redirects, cookies, and file decompression.

Header values are described in the curl_getinfo() manual. Highlights include:

"url" the final page URL after redirects
"content_type" the content type (e.g. "text/html; charset=utf-8")
"http_code" the page status code (e.g. "200" on success)
"filetime" the date stamp on the remote file

This function adds::

"errno" the CURL error number (0 on success)
"errmsg" the CURL error message for the error number
"content" the page content (e.g. HTML text, image bytes, etc.)

On success, "errno" is 0, "http_code" is 200, and "content" contains the web page.

On an error with a bad URL, unknown host, timeout, or redirect loop, "errno" has a non-zero error code and "errmsg" has an error message (see the CURL error code list).

On an error with a missing web page or insufficient permissions, "errno" is 0, "http_code" has a non-200 HTTP status code, and "content" contains the site’s error message page (see the Wikipedia List of HTTP status codes).

This function can be extended to support GET and POST for web forms, file uploads, logins, SSL for encrypted web pages, and access through proxy servers.

Example

Read a web page and check for errors:

$result = get_web_page( $url );
if ( $result['errno'] != 0 )
... error: bad url, timeout, redirect loop ...

if ( $result['http_code'] != 200 )
... error: no page, no permissions, no service ...

$page = $result['content'];

Explanation

The CURL functions are a standard extension for PHP 4.0.2 and later. They are included in most PHP installations. You can check your installation by calling phpinfo() and looking for --with-curl in the Configure Command section at the start of its output. The CURL library manual has more installation information.

The CURL functions are built atop libcurl, a cross-platform library also available for Perl, Python, Ruby, C/C++, Java, and other languages. The PHP documentation is an abbreviated form of that available at the libcurl web site.

Getting a web page using CURL always includes these steps:

  1. Create a CURL handle using curl_init().
  2. Set up the request using curl_setopt() or curl_setopt_array().
  3. Request the page using curl_exec().
  4. Check if an error occurred using curl_errno().
  5. Get the HTTP header using curl_getinfo().
  6. Close the CURL handle using curl_close().

curl_init()

The curl_init() function creates a CURL handle to manage the web page request. The only function argument is a URL, which is saved until a later curl_exec() call executes the request.

curl_setopt() or curl_setopt_array()

The curl_setopt() and curl_setopt_array() functions configure the page request by setting options and their values. These two PHP functions are equivalent: curl_setopt() sets one option at a time, while curl_setopt_array() sets a list of options all at once (PHP 5.1.3+ required).

The manual has a long list of available options for different uses of CURL. Here are the basics for getting a web page:

Essential options:

  • CURLOPT_RETURNTRANSFER. When true, CURL returns the web page content from curl_exec(). When false, CURL prints the page to the screen (which is hardly ever useful).
  • CURLOPT_HEADER. When true, CURL includes the web server’s HTTP response header in the page. When false, it is excluded, making the returned page easier to parse later. The header is still available by calling curl_getinfo().

Important options:

  • CURLOPT_FOLLOWLOCATION. When true, CURL follows web page redirects automatically. When false, it stops on the first redirect and returns an error.
  • CURLOPT_ENCODING. When empty, CURL handles uncompressed and compressed transfers. Compressed pages are automatically decompressed before they’re returned by curl_exec().

Web etiquette options:

  • CURLOPT_USERAGENT. The user agent is the name of the application making the request (such as a web browser). If left empty, some web servers will reject the request.
  • CURLOPT_AUTOREFERER. The referer is a URL for the web page that linked to the requested web page. When following redirects, set this to true and CURL automatically fills in the URL of the page being redirected away from.

Error handling options:

  • CURLOPT_CONNECTTIMEOUT. Fail if a web server doesn’t respond to a connection within a time limit (seconds).
  • CURLOPT_TIMEOUT. Fail if a web server doesn’t return the web page within a time limit (seconds).
  • CURLOPT_MAXREDIRS. If a web page redirect leads to another redirect, and another, stop after a maximum number of redirects.

curl_exec()

The curl_exec() function issues the request and returns the web page (HTML, XHTML, XML, image, etc.).

curl_errno() and curl_error()

There are two kinds of errors:

  • System errors result from a bad URL, bad protocol, unknown host, connect timeout, response timeout, or redirect loop. On a system error, curl_errno() returns a non-zero error code (see the CURL error code list) and curl_error() returns a short error message suitable for debugging output.
  • Site errors result from an unknown web page, insufficient permissions, or a web server that is down for maintenance. On a site error or success, curl_errno() returns zero and the web server’s HTTP status code is available in the server header returned by curl_getinfo() (see below).

curl_getinfo()

The curl_getinfo() function returns values from the server header, such as the web page’s content type and the HTTP status code. See the manual for a list of header fields returned. Highlights include:

"url" the final web page URL after redirects
"content_type" the content type (e.g. "text/html; charset=utf-8")
"http_code" the web page status code (e.g. "200" on success)
"filetime" the date stamp on the remote file

The "http_code" entry has the HTTP status code (see Wikipedia’s List of HTTP status codes). On success, the status is 200. The code is 404 if the web page is not found, 401 if authentication is required, 503 if the web site is down, etc.

curl_close()

The curl_close() function destroys the CURL handle, freeing memory.

Alternatives

The CURL library is one way to read pages from a web server. There are two main alternatives:

  • Fopen wrappers. The fopen wrappers extend PHP's file reading functions to handle getting files via HTTP and FTP protocols. They are less powerful than CURL, but they are included in all PHP distributions. See the PHP tip: How to get a web page using the fopen wrappers for code to use these functions.
  • HTTP functions. The HTTP extension for PHP 5 and later provides classes and functions to handle HTTP requests. The extension is built atop CURL, but provides a nicer interface. The extension is not included in standard PHP distributions, but it can be added easily.

Downloads

Further reading

Related articles at NadeauSoftware.com

Web articles and specifications

Comments

Curl set up is very valuble

I spent a lot of time to search curllib file and Installed on my XP system. But It wont work, Duting the search I got this site and read the valuable content. I know by your site. It is very simple to active the curl function. Just goto php.ini file and remove the comment on extension=php_curl.dll line. It is working fine. Note your PHP version must be above 4.3x.

Thanks to the site content and site organizer.

channappagoudar, Bangalore, India

An Alternative for the OOP PHP Coders

I have an easy copy/paste alternative for the PHP OOP Coders: a PHP cUrl Class ready for use.
The best route is to study this page and understand the way it all works and then use my script and enjoy the ease of Object Oriented Programming.
If you understand how it works you can make changes to my class easily, if you don't, it's very easy to just use it in your projects to achieve your goals.

After all, if a script serves its purpose, some may not even care how it works ;)

Is there a way to get all

Is there a way to get all text on a page, such as that text dynamicaly placed on the page through DIV tags. Is there a way to just do a "select all" and copy of all text on a page through PHP.

Getting "all" of the text

There are several ways to interpret your question...

  • Can you get all HTML text on a page?
    Yes. Use the CURL code explained in this article to download the full HTML web page from a server.
  • Can you get all plain text on a page?
    Yes. The plain text on a page is basically the HTML page with the HTML tags properly stripped away. See the first part of this article on How to extract keywords from a web page for the steps needed to get the plain text on a page.
  • Can you get all dynamic text on a page, as added by scripts run by the server?
    Yes. The page you download with CURL is the full web page assembled by the server. If the server uses scripts to dynamically generate pages from database content, those scripts will be run by the server and you'll get the result.
  • Can you get all dynamic text on a page, as added by JavaScript run by the browser?
    No. A web page with embedded JavaScript is actually a program. CURL gives you the program's source code (HTML and JavaScript), but doesn't run that program. To run a page's embedded JavaScript you need (1) a JavaScript interpreter, and (2) the Document Object Model (DOM) for the page. Browsers have these, but PHP does not. People are working on PHP versions of these, but developing these are big tasks. If this is what you need, you might skip PHP and instead look at writing C++ code using WebKit.
  • Can you do a "select all" on a web browser's displayed page?
    Possibly, but I haven't done it. Page readers for the visually impaired read the page text shown by a web browser. They do this as a browser plug-in, so it is conceivable that you could write (or find) a plug-in to help you get the page text. My guess, however, is that this would be a very involved task.

    It's possible you could use remote-management/screen-sharing/testing tools that let you send fake keyboard and mouse events to an application. Sending a "select all" and "copy" to a browser might get you the page text you want. But this too seems pretty involved and probably isn't a robust solution.

    Finally, I should point out that JavaScripts can run throughout the time a page is being displayed. A "select all" and "copy" scheme only gets you a snapshot of the page at the current moment.

That was great, helped me a lot

This is great article.
I am new to PHP and this helped me a lot to setup some functionality
at my site.

Thanks.

i would like to catch the title of a url

hi

now i was developing a search engine model site

that why i need to catch the title of the url what i was given

and the some text from the url like google search

and i store that it into my database for future search

can you help me to done this

regards

sakthivel

Re: i would like to catch the title of a url

First, I suggest you browse the beginning of my article on How to extract keywords from a web page. It explains how to get the page, extract its content type, and convert the text to UTF-8 before further processing.

Second, getting the page title is easy. After converting to UTF-8, use preg_match( ) to look for text between <title> and </title>.

Third, creating synopsis text is hard. Early search engines used authored text in <meta> description tags. Spam web sites abused this by adding bogus text. So, search engines dropped <meta> tags (nobody uses them any more) and now create synopsis text themselves. Different search engines do this differently. Some may use the first paragraph on the page. Others, like Google, save the whole page and extract matching pieces of the text for each search. You can use one of these methods, or come up with something better. In any case, it's going to require some time to develop.

for search engine development

when i given a url it will return the first paragraph text into a variable of three line text. suppose i was given www.php.net and click add it will return like the following and i will add it into the database for future serach.

PHP: Downloads Regular source and binary snapshots are available from snaps.php.net. These are not intended for production use! To download the very latest development ... www.php.net/downloads.php

it is the search result of google. such like code i need. but i am using pregmatch tag to get the title and p tag for scrawling. some site it will not useful.

thanks in advance for your guidence

regards, sakthivel

Re: for search engine development

What I believe you are looking for is a definitive way to get useful synopsis text from a web page. The approach you've tried is to get the first paragraph. As you note, this won't work for all web sites. There may not be a <p> tag at all. And if there is, the first paragraph of text might not be the best choice for a synopsis anyway.

Unfortunately, there is no simple answer. Creating good synopsis text requires some way of deciding which words, sentence, or paragraph is "best" on a page. And that's a very subjective idea that doesn't lend itself well to an automatic parser.

The current solution used by Google, and others, is to create a custom synopsis for each page shown for each search query. The synopsis includes the text before and after the best occurrence of the search words or phrase in the page's text. This means that the search engine database does not include a synopsis at all. It contains the full page text instead, and the synopsis is generated on-the-fly for every search query.

great article

Didn't really know anything about the curl library but needed to pull information from several pages. A nice summary like this is exactly what I was looking for! Thanks a lot!

Andrew

nexus.site90.net

nexus.site90.net

Curl is a great way to import MSN or GMAIL contacts. It is used in online transcition verification system too.

PHP, cURL, CURLOPT FOLLOWLOCATION and open basedir/Safe Mode

While using curl functions you will get the below error. Please check your safe_mode is off and open_dir modes in your configuration files.

For more information's I have posted the reply on the below forum site

http://www.ezeeforum.com/viewtopic.php?f=8&t=22&p=164#p164

Regards
Branze

How bout loading the pages but displaying other sections

What i mean by the subject line was,
is there any way that when curl get the web-pages, the "main" content of the php pages
still loads, and the user could see some informations that the page was still "loading" other content / reading other external web-pages

Thanks before!

Re: How bout loading the pages but displaying other sections

CURL does not provide intermediate results while it is reading the page content from a remote web server. If this is something you need to do, then you might use PHP's fopen wrappers instead. I discuss these in How to get a web page using the fopen wrappers.

But before going there, consider the overall design. If the idea is to give the user something to see in their browser while you're using CURL to get more content, then there are several ways you could do this without abandoning CURL.

One approach uses a frame. When the user clicks a link, it invokes server-side code that sends back a page with an empty frame and a "Loading..." message. The frame includes a page on your server that invokes your PHP code to use CURL to get content and return it as the content of that frame. When that framed content arrives at the user's browser, use a bit of JavaScript to take down the "Loading..." message.

Another approach does the same thing, but uses AJAX instead of a frame. The first page includes JavaScript that asks the server for content. The server runs your PHP and CURL code and returns the content. The JavaScript then inserts that content into the page and takes down a "Loading..." message.

There are probably several schemes like this that cause a second page's content to load while the user is looking at the first page.

Of course, I'm assuming that your PHP code using CURL is adding value somehow to the page it is retrieving from some other web server. If it is just going to forward it as-is to the user's browser, then skip CURL-ing and send that page's URL to the browser directly.

Also please note that there are legal issues. If you are scraping somebody else's web site to display their content as if it is your own, you can be sued for copyright infringement. There is a lot of legal precedent now and you'll probably lose. So be sure you're doing all this for legal and ethical reasons.

THANKS!

Thanks a lot for the great information on using cURL to fetch the contents of a website. After my webhost disabled the use of the get_file_contents function, I needed a quick fix... and this worked perfectly.

fantasitic site

clear,precise and usefull, I bookmarked your site and will go through it with a fine tooth comb.

Get value

Thanks a lot, I was searching for this code for a long time. I need to get the selected content of a web page using curl, How it is can be done? That is I have to parce the value.

Without cURL

I have been using cURL for years, and it's safe to say it can do any HTTP fetching you need to do. Sometimes it can be disabled on cheap/free hosting servers though, which is a drag for some. It tends to be disabled alongside file_get_contents and fopen for remote resources (i.e. web based).

If anyone is reading and unable to use cool cURL functions, I have a small class using socket functions that can perform basic HTTP fetching here.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting