PHP tip: How to extract URLs from a web page

Technologies: HTML +, XHTML 1+, WML 1+, PHP 4.3+, UTF-8,

URL extraction is at the core of link checkers, search engine spiders, and a variety of web page analysis tools. While <a> and <img> elements are primary sources of URLs, there are more than 70 element attributes with URLs in HTML, XHTML, WML, and assorted HTML extensions. This tip shows how to extract URLs from all of these.

This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, get the page's content type, convert to UTF-8, strip away HTML syntax, punctuation, symbol characters, and numbers, and break a page down into a keyword list.

Introduction

URL extraction from web pages is useful for:

  • Link checkers to find dead links.
  • Search engine spiders to find paths between web pages.
  • Site map generators to construct hierarchical lists of site contents.
  • Content filters to check for inappropriate links to spam sites or adult content.
  • Anti-phishing filters to check that links go to the real site.
  • Web page analysis tools to collect file statistics and help optimize CDN use.

The <a> and <img> elements are often the principal focus for URL extraction, but there are more than 70 elements and attributes that contain URLs. For instance, HTML 4.01 includes URLs for the content of a <frame>, <iframe>, <script>, or <link>, the action for a <form>, the code for an <object>, the image for a <body> background, <input> icon, or image map <area>, the citation source for <blockquote>, <del>, <ins>, and <q>, and several more. There are more URLs in extensions to HTML, such as the background image on a <table>, <td>, or <th>, or the sound for a <bgsound>. HTML 5.0, still in the draft stage, adds elements for audio and video URLs, and the Web Forms 2.0 draft adds URLs for form data.

The code below handles URL extraction for all of these cases.

Code

The extract_html_urls( ) function below uses regular expressions to find HTML, XHTML, and WML element attributes that include URLs. The URLs are returned in an associative array of associative arrays of arrays. Keys to the outer associative array are element names, such as "a" for <a>. Keys to the inner associative array are attribute names, such as "href" for an <a> element.

For embedded CSS styles, the function calls extract_css_urls(), which is separately discussed in a companion article on How to extract URLs from a CSS file.

Usage examples and detailed explanations follow in the next sections.

Download: extract_html_urls.zip.

/**
 * Extract URLs from a web page.
 */
function extract_html_urls( $text )
{
    $match_elements = array(
        // HTML
        array('element'=>'a',       'attribute'=>'href'),       // 2.0
        array('element'=>'a',       'attribute'=>'urn'),        // 2.0
        array('element'=>'base',    'attribute'=>'href'),       // 2.0
        array('element'=>'form',    'attribute'=>'action'),     // 2.0
        array('element'=>'img',     'attribute'=>'src'),        // 2.0
        array('element'=>'link',    'attribute'=>'href'),       // 2.0
 
        array('element'=>'applet',  'attribute'=>'code'),       // 3.2
        array('element'=>'applet',  'attribute'=>'codebase'),   // 3.2
        array('element'=>'area',    'attribute'=>'href'),       // 3.2
        array('element'=>'body',    'attribute'=>'background'), // 3.2
        array('element'=>'img',     'attribute'=>'usemap'),     // 3.2
        array('element'=>'input',   'attribute'=>'src'),        // 3.2
 
        array('element'=>'applet',  'attribute'=>'archive'),    // 4.01
        array('element'=>'applet',  'attribute'=>'object'),     // 4.01
        array('element'=>'blockquote','attribute'=>'cite'),     // 4.01
        array('element'=>'del',     'attribute'=>'cite'),       // 4.01
        array('element'=>'frame',   'attribute'=>'longdesc'),   // 4.01
        array('element'=>'frame',   'attribute'=>'src'),        // 4.01
        array('element'=>'head',    'attribute'=>'profile'),    // 4.01
        array('element'=>'iframe',  'attribute'=>'longdesc'),   // 4.01
        array('element'=>'iframe',  'attribute'=>'src'),        // 4.01
        array('element'=>'img',     'attribute'=>'longdesc'),   // 4.01
        array('element'=>'input',   'attribute'=>'usemap'),     // 4.01
        array('element'=>'ins',     'attribute'=>'cite'),       // 4.01
        array('element'=>'object',  'attribute'=>'archive'),    // 4.01
        array('element'=>'object',  'attribute'=>'classid'),    // 4.01
        array('element'=>'object',  'attribute'=>'codebase'),   // 4.01
        array('element'=>'object',  'attribute'=>'data'),       // 4.01
        array('element'=>'object',  'attribute'=>'usemap'),     // 4.01
        array('element'=>'q',       'attribute'=>'cite'),       // 4.01
        array('element'=>'script',  'attribute'=>'src'),        // 4.01
 
        array('element'=>'audio',   'attribute'=>'src'),        // 5.0
        array('element'=>'command', 'attribute'=>'icon'),       // 5.0
        array('element'=>'embed',   'attribute'=>'src'),        // 5.0
        array('element'=>'event-source','attribute'=>'src'),    // 5.0
        array('element'=>'html',    'attribute'=>'manifest'),   // 5.0
        array('element'=>'source',  'attribute'=>'src'),        // 5.0
        array('element'=>'video',   'attribute'=>'src'),        // 5.0
        array('element'=>'video',   'attribute'=>'poster'),     // 5.0
 
        array('element'=>'bgsound', 'attribute'=>'src'),        // Extension
        array('element'=>'body',    'attribute'=>'credits'),    // Extension
        array('element'=>'body',    'attribute'=>'instructions'),//Extension
        array('element'=>'body',    'attribute'=>'logo'),       // Extension
        array('element'=>'div',     'attribute'=>'href'),       // Extension
        array('element'=>'div',     'attribute'=>'src'),        // Extension
        array('element'=>'embed',   'attribute'=>'code'),       // Extension
        array('element'=>'embed',   'attribute'=>'pluginspage'),// Extension
        array('element'=>'html',    'attribute'=>'background'), // Extension
        array('element'=>'ilayer',  'attribute'=>'src'),        // Extension
        array('element'=>'img',     'attribute'=>'dynsrc'),     // Extension
        array('element'=>'img',     'attribute'=>'lowsrc'),     // Extension
        array('element'=>'input',   'attribute'=>'dynsrc'),     // Extension
        array('element'=>'input',   'attribute'=>'lowsrc'),     // Extension
        array('element'=>'table',   'attribute'=>'background'), // Extension
        array('element'=>'td',      'attribute'=>'background'), // Extension
        array('element'=>'th',      'attribute'=>'background'), // Extension
        array('element'=>'layer',   'attribute'=>'src'),        // Extension
        array('element'=>'xml',     'attribute'=>'src'),        // Extension
 
        array('element'=>'button',  'attribute'=>'action'),     // Forms 2.0
        array('element'=>'datalist','attribute'=>'data'),       // Forms 2.0
        array('element'=>'form',    'attribute'=>'data'),       // Forms 2.0
        array('element'=>'input',   'attribute'=>'action'),     // Forms 2.0
        array('element'=>'select',  'attribute'=>'data'),       // Forms 2.0
 
        // XHTML
        array('element'=>'html',    'attribute'=>'xmlns'),
 
        // WML
        array('element'=>'access',  'attribute'=>'path'),       // 1.3
        array('element'=>'card',    'attribute'=>'onenterforward'),// 1.3
        array('element'=>'card',    'attribute'=>'onenterbackward'),// 1.3
        array('element'=>'card',    'attribute'=>'ontimer'),    // 1.3
        array('element'=>'go',      'attribute'=>'href'),       // 1.3
        array('element'=>'option',  'attribute'=>'onpick'),     // 1.3
        array('element'=>'template','attribute'=>'onenterforward'),// 1.3
        array('element'=>'template','attribute'=>'onenterbackward'),// 1.3
        array('element'=>'template','attribute'=>'ontimer'),    // 1.3
        array('element'=>'wml',     'attribute'=>'xmlns'),      // 2.0
    );
 
    $match_metas = array(
        'content-base',
        'content-location',
        'referer',
        'location',
        'refresh',
    );
 
    // Extract all elements
    if ( !preg_match_all( '/<([a-z][^>]*)>/iu', $text, $matches ) )
        return array( );
    $elements = $matches[1];
    $value_pattern = '=(("([^"]*)")|([^\s]*))';
 
    // Match elements and attributes
    foreach ( $match_elements as $match_element )
    {
        $name = $match_element['element'];
        $attr = $match_element['attribute'];
        $pattern = '/^' . $name . '\s.*' . $attr . $value_pattern . '/iu';
        if ( $name == 'object' )
            $split_pattern = '/\s*/u';  // Space-separated URL list
        else if ( $name == 'archive' )
            $split_pattern = '/,\s*/u'; // Comma-separated URL list
        else
            unset( $split_pattern );    // Single URL
        foreach ( $elements as $element )
        {
            if ( !preg_match( $pattern, $element, $match ) )
                continue;
            $m = empty($match[3]) ? $match[4] : $match[3];
            if ( !isset( $split_pattern ) )
                $urls[$name][$attr][] = $m;
            else
            {
                $msplit = preg_split( $split_pattern, $m );
                foreach ( $msplit as $ms )
                    $urls[$name][$attr][] = $ms;
            }
        }
    }
 
    // Match meta http-equiv elements
    foreach ( $match_metas as $match_meta )
    {
        $attr_pattern    = '/http-equiv="?' . $match_meta . '"?/iu';
        $content_pattern = '/content'  . $value_pattern . '/iu';
        $refresh_pattern = '/\d*;\s*(url=)?(.*)$/iu';
        foreach ( $elements as $element )
        {
            if ( !preg_match( '/^meta/iu', $element ) ||
                !preg_match( $attr_pattern, $element ) ||
                !preg_match( $content_pattern, $element, $match ) )
                continue;
            $m = empty($match[3]) ? $match[4] : $match[3];
            if ( $match_meta != 'refresh' )
                $urls['meta']['http-equiv'][] = $m;
            else if ( preg_match( $refresh_pattern, $m, $match ) )
                $urls['meta']['http-equiv'][] = $match[2];
        }
    }
 
    // Match style attributes
    $urls['style'] = array( );
    $style_pattern = '/style' . $value_pattern . '/iu';
    foreach ( $elements as $element )
    {
        if ( !preg_match( $style_pattern, $element, $match ) )
            continue;
        $m = empty($match[3]) ? $match[4] : $match[3];
        $style_urls = extract_css_urls( $m );
        if ( !empty( $style_urls ) )
            $urls['style'] = array_merge_recursive(
                $urls['style'], $style_urls );
    }
 
    // Match style bodies
    if ( preg_match_all( '/<style[^>]*>(.*?)<\/style>/siu', $text, $style_bodies ) )
    {
        foreach ( $style_bodies[1] as $style_body )
        {
            $style_urls = extract_css_urls( $style_body );
            if ( !empty( $style_urls ) )
                $urls['style'] = array_merge_recursive(
                    $urls['style'], $style_urls );
        }
    }
    if ( empty($urls['style']) )
        unset( $urls['style'] );
 
    return $urls;
}

Examples

Read a web page using file_get_contents( ), extract its URLs, and print them:

$text = file_get_contents( $url );
$urls = extract_html_urls( $text );
print_r( $urls );

Only print the anchor URLs:

if ( !empty( $urls['a'] ) )
    print_r( $urls['a']['href'] );

Only print the image URLs:

if ( !empty( $urls['img'] ) )
    print_r( $urls['img']['src'] );

Collect all the extracted URLs into one long list, ignoring the element and attribute they came from:

$all_urls = array( );
foreach ( $urls as $element_entry )
    foreach ( $element_entry as $attr_entry )
        $all_urls = array_merge( $all_urls, $attr_entry );

Explanation

The code for URL extraction is mostly driven by a table of element and attribute names. The table's content is derived from specifications for HTML (2.0, 3.2, 4.01, 5.0 draft), XHTML (1.0, 1.1), WML (1.3, 2.0), Web Forms (2.0 draft), and browser-specific HTML extensions. Meta tags have an additional table of conventional attributes derived from the HTTP 1.1 specification and common use. All of these are discussed in the Appendices of this article.

The code proceeds through the following steps:

  1. Extract all elements from the text.
  2. Match all standard and extension elements that contain attribute URLs.
  3. Match all <meta> elements that contain HTTP header fields that contain URLs.
  4. Match all "style" attributes and parse their CSS for URLs.
  5. Match all <style>...</style> blocks and parse their CSS for URLs.

CSS text is passed to extract_css_urls(), discussed in the article How to extract URLs from a CSS file.

Finding HTML elements

PHP's preg_match_all() is used to find all text strings that start with "<" and end with ">". Technically, neither of these characters may appear in HTML except as part of an element. In practice, browsers are more lenient and allow stand-alone "<" and ">", such as in a code example like "if ( a < b )". To mimic this behavior, extract_html_urls() recognizes the start of an element only if "<" is followed by a letter.

Finding elements and attributes

PHP's preg_match() is used to match elements and attributes by name. While XHTML specifications require that element and attribute names be in lower case, HTML is more lenient and real-world content mixes case. So, expression matches always add the /i pattern modifier to do caseless matches.

Handling character encodings

HTML, XHTML, and WML text defaults to the US-ASCII character encoding. This may be overridden:

  • For individual files, the content-type directive in the HTTP header when downloading the file may specify an alternative character encoding. This is very common.
  • HTML and XHTML text may include a <meta> element to specify a character encoding. This is fairly common.

While extract_html_urls() could look for a <meta> element and adapt, it cannot handle the more common case where the encoding is set in the HTTP header. Instead, it is up to the application to determine the encoding of the text and use PHP's iconv() to convert to UTF-8 first. Thereafter, extract_html_urls() uses the /u pattern modifier to handle Unicode character matching.

For more information on character encodings, see the article on How to get a web page's content type.

Using returned URLs

All URLs are returned in an associative array of associative arrays of arrays. The outermost array's keys are element names, while inner array keys are attribute names.

Returned URLs may be absolute or relative, depending upon how they were entered in the web page. Applications will need to use the page's base URL to convert relative URLs into absolute URLs if they intend to use the URLs for link checking or other analysis.

Appendix A: HTML standards

Today's HTML is the result of three formal specifications: HTML 2.0 from 1995, HTML 3.2 from 1997, and HTML 4.01 from 1999.

HTML 2.0

HTML 2.0 defines rudimentary HTML, including the familiar <a>, <form>, <img>, and <link> elements and their URL attributes. One historical oddity is the urn field of <a>, which was never fully specified and dropped in later specifications (but several browsers support it anyway).

HTML 2.0 elements and attributes with URLs
Element Attribute URL links to...
<a> href Destination for the anchor
<a> urn Destination for the anchor (not fully specified)
<base> href Absolute URL for resolving page relative URLs
<form> action Action to take when the form is submitted
<img> src Image to include on page
<link> href Content linked into the page

HTML 3.2

HTML 3.2 added more elements with URL attributes. The <applet> element supported Java applets, but it has since been deprecated in favor of <object> introduced in HTML 4.01.

HTML 3.2 additional elements and attributes with URLs
Element Attribute URL links to...
<applet> code Class file for the applet
<applet> codebase Absolute URL for resolving applet relative URLs
<area> href Destination for the image map region
<body> background Background image for the body
<img> usemap Client-side image map name
<input> src Icon image for the input control

HTML 4.01

HTML 4.01 added even more element attributes with URLs. While <script>, <frame>, <iframe>, and <object> get wide use, there are several obscure elements with URL attributes, such as citation URLs for the <del>, <ins>, <blockquote>, and <q> elements.

HTML 4.01 additional elements and attributes with URLs
Element Attribute URL links to...
<applet> archive JAR files for the applet
<applet> object Serialized representation for the applet
<blockquote> cite Source material for the quote
<del> cite Explanation for deleted content
<frame> longdesc Long description for the frame
<frame> src Content for the frame
<head> profile Catalog of metadata types and values
<iframe> longdesc Long description for the frame
<iframe> src Content for the frame
<img> longdesc Long description for the image
<input> usemap Client-side image map name
<ins> cite Explanation for inserted content
<object> archive Archive files for the object
<object> classid Implementation for the object
<object> codebase Absolute URL for resolving object relative URLs
<object> data Data for the object
<object> usemap Client-side image map name
<q> cite Source material for the quote
<script> src Script linked into the page

HTML 5.0 (draft)

In 2008, HTML 5 is still an early draft. It may take several more years to reach a final specification, but a few browsers have begun to add support already. HTML 5 adds more elements for multimedia content.

HTML 5.0 (draft) additional elements and attributes with URLs
Element Attribute URL links to...
<audio> src Audio content
<command> icon Icon image for input control
<embed> src Data for the object
<event-source> src Server-side event source
<html> manifest Manifest for the page's content
<source> src Multimedia content
<video> src Video content
<video> poster Image when no video

Web Forms 2.0 (draft)

Web Forms 2.0 is a working draft to extend form features to support data-driven forms and the use of form controls, like buttons, outside of a form. The draft's features are expected to be incorporated into HTML 5.0, but meanwhile the specification is reasonably stable and some browsers already support it. The specification adds a few more elements and attributes with URLs:

Web Forms 2.0 (draft) additional elements and attributes with URLs
Element Attribute URL links to...
<button> action Action to take on a button press
<datalist> data Data available for a list
<form> data Data for the form
<input> action Action to take on an input choice
<select> data Data for the selection list

HTML extensions

Over the years, browser makers have proposed new elements and attributes for HTML. Many of these have become part of the standards. Others have not, but remain in use.

Microsoft's HTML Elements documentation lists several non-standard elements that include URLs:

Microsoft Internet Explorer additional elements and attributes with URLs
Element Attribute URL links to...
<bgsound> src Audio content
<embed> pluginspage Plugin to be embedded
<img> dynsrc Video content
<img> lowsrc Low-resolution alternative image
<input> dynsrc Video content
<input> lowsrc Low-resolution alternative icon image
<table> background Background image for the table
<td> background Background image for the data cell
<th> background Background image for the header cell
<xml> src XML content

Microsoft's WebTV adds a few more URL attributes:

Microsoft WebTV additional elements and attributes with URLs
Element Attribute URL links to...
<body> credits Credits
<body> instructions Instructions
<body> logo Product logo

Netscape adds a few more (Netscape has been discontinued and its HTML extensions documentation is now gone):

Netscape additional elements and attributes with URLs
Element Attribute URL links to...
<html> background Background image for the page
<ilayer> src Content for the inline layer
<layer> src Content for the layer
<div> src Content for the layer
<div> href Destination for the div as an anchor!

Apple's Safari supports several elements anticipating features in HTML 5 and for compatibility with legacy extensions to HTML. Only one Safari-specific element includes a URL attribute:

Apple Safari additional elements and attributes with URLs
Element Attribute URL links to...
<embed> code Data for an embedded object

Meta elements

The <meta> element is a catch-all for generic information about the web page, such as who created it, when, and why. One important use stores HTTP header information so that a web page saved to a file by a user still has essential information, like the content type, character set encoding, etc.

The HTTP 1.1 specification lists the following HTTP header fields that contain URLs. While use of these varies from rare to very rare, they are possible in a <meta> element and are checked by extract_html_urls(). For all of these, the <meta> element's "http-equiv" attribute contains the HTTP field name and its "content" attribute contains the URL value.

HTTP 1.1 header fields with URLs
Field URL links to...
content-base Base URL for the page
content-location URL for the page
referer URL of the page that linked to the page
location Redirection URL for the page

A <meta> refresh sets "http-equiv" to "refresh" (even though it isn't an HTTP header field) and the "content" attribute to a refresh time, in seconds, and an optional URL. While refreshes are not standardized, are deprecated, and are usually very bad form, they are in common use. The "content" attribute has one of these forms:

  • "123" - refresh the same page in 123 seconds.
  • "123; http://example.com" - refresh to example.com in 123 seconds.
  • "123; url=http://example.com" - refresh to example.com in 123 seconds.

The extract_html_urls() function handles all of these.

While there are other uses of <meta> element content, there are no formal standards and no widely used de facto standards. So, extract_html_urls() ignores these.

Special cases

There are a few special cases to handle:

  • The archive attribute of an <object> may contain a list of URLs, space separated.
  • The archive attribute of an <applet> may contain a list of URLs, comma separated (yup, not space separated like in an <object>).
  • Any element may include a style attribute that includes CSS code. That code may use @import or url() syntax that each contain a URL. See the article on How to extract URLs from a CSS file for further information.

Appendix B: XHTML standards

XHTML 1.0 in 2000, and XHTML 1.1 in 2007, redefined HTML elements with a more rigorous XML syntax. Most of HTML 4.01's URL-using elements are also available in XHTML 1.1, plus one new element attribute to specify XHTML's XML namespace:

XHTML 1.1 additional elements and attributes with URLs
Element Attribute URL links to...
<html> xmlns XHTML name space

In 2008, an XHTML 2.0 specification is in an early draft stage and lacks sufficient detail yet to make a list of new elements and attributes that use URLs.

Appendix C: WML standards

In 2000, when mobile device makers wanted to add dynamic content to their cell phones, they approached it with a hypertext style that defined multiple "cards" in a "deck" and links between the cards. Their Wireless Markup Language, WML, borrowed some of the syntax of HTML 2.0.

Mobile device "portals" on the web may use WML, and many web browsers support it.

WML 1.2 and 1.3

WML 1.2 and 1.3 supports much of HTML 2.0, then adds a couple more elements that may contain URLs or file paths:

WML 1.3 additional elements and attributes with URLs
Element Attribute URL links to...
<access> path Access limited to other decks with this path
<card> onenterforward Page to load going forward
<card> onenterbackward Page to load going backward
<card> ontimer Page to load after a timer expiration
<go> href Destination for an anchor
<option> onpick Page to load after an option is selected
<template> onenterforward Page to load going forward
<template> onenterbackward Page to load going backward
<template> ontimer Page to load after a timer expiration

DevGuru has a nice WML 1.2 summary.

WML 2.0

WML 2.0 redefined WML atop XHTML 1.0 and added one more element attribute with a URL:

WML 2.0 additional elements and attributes with URLs
Element Attribute URL links to...
<wml> xmlns WML name space

The WML 1.3 and 2.0 DTDs are free, but the specifications cost money.

Appendix D: Why not use DOCTYPE?

The DOCTYPE listed on the first line of a web page refers to an XML DTD (Document Type Description) that gives a detailed specification of the HTML, XHTML, WML, or whatever syntax used by the page. Why not use this to find element attributes with URLs, instead of using a giant table of known elements?

  • The DOCTYPE is optional. Older content doesn't have it.
  • The DOCTYPE refers to an XML DTD that should match the page, but may not. It is common for content to refer to an HTML DTD, but include HTML extensions not found in that DTD.
  • DTDs define syntax, not semantics. They say which elements have which attributes, but not what those attributes mean. <meta> elements, for instance, have a content attribute that contains CDATA (character data). That's all the DTD says, but we know that certain <meta> elements contain HTTP header equivalents that include URLs. The DTD doesn't tell us enough to handle these, and many other cases.

Downloads

Further reading

Related articles at NadeauSoftware.com

Web articles and specifications

Comments

Bugfix

I found this script very useful but I found one bug. sometimes attributes are broken onto multiple lines like the following:

To make this work, change the regex from:
$pattern = '/^' . $name . '\s.*' . $attr . $value_pattern . '/iu';

TO:
$pattern = '/^' . $name . '\s.*' . $attr . $value_pattern . '/siu';

So that the dot can match newlines.

Thank you! I am working on a

Thank you! I am working on a project where I need to extract urls from any given file, such as bookmark files exported from ie or ff as html. As simple as it may seem, I couldn't get this done despite endless attempts. I stumbled upon this post and tried your code, it worked right off the bat. Thank you for your detailed explanation as well!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting