URL extraction is at the core of link checkers, search engine spiders, and a variety of web page analysis tools. While <a> and <img> elements are primary sources of URLs, there are more than 70 element attributes with URLs in HTML, XHTML, WML, and assorted HTML extensions. This tip shows how to extract URLs from all of these.
Table of Contents
- Introduction
- Code
- Examples
- Explanation
- Finding HTML elements
- Finding elements and attributes
- Handling character encodings
- Using returned URLs
- Appendix A: HTML standards
- HTML 2.0
- HTML 3.2
- HTML 4.01
- HTML 5.0 (draft)
- Web Forms 2.0 (draft)
- HTML extensions
- Meta elements
- Special cases
- Appendix B: XHTML standards
- Appendix C: WML standards
- Appendix D: Why not use DOCTYPE?
- Downloads
- Further reading
This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, get the page's content type, convert to UTF-8, strip away HTML syntax, punctuation, symbol characters, and numbers, and break a page down into a keyword list.
Introduction
URL extraction from web pages is useful for:
- Link checkers to find dead links.
- Search engine spiders to find paths between web pages.
- Site map generators to construct hierarchical lists of site contents.
- Content filters to check for inappropriate links to spam sites or adult content.
- Anti-phishing filters to check that links go to the real site.
- Web page analysis tools to collect file statistics and help optimize CDN use.
The <a> and <img> elements are often the principal focus for URL extraction, but there are more than 70 elements and attributes that contain URLs. For instance, HTML 4.01 includes URLs for the content of a <frame>, <iframe>, <script>, or <link>, the action for a <form>, the code for an <object>, the image for a <body> background, <input> icon, or image map <area>, the citation source for <blockquote>, <del>, <ins>, and <q>, and several more. There are more URLs in extensions to HTML, such as the background image on a <table>, <td>, or <th>, or the sound for a <bgsound>. HTML 5.0, still in the draft stage, adds elements for audio and video URLs, and the Web Forms 2.0 draft adds URLs for form data.
The code below handles URL extraction for all of these cases.
Code
The extract_html_urls( ) function below uses regular expressions to find HTML, XHTML, and WML element attributes that include URLs. The URLs are returned in an associative array of associative arrays of arrays. Keys to the outer associative array are element names, such as "a" for <a>. Keys to the inner associative array are attribute names, such as "href" for an <a> element.
For embedded CSS styles, the function calls extract_css_urls(), which is separately discussed in a companion article on How to extract URLs from a CSS file.
Usage examples and detailed explanations follow in the next sections.
Download: extract_html_urls.zip.
/**
* Extract URLs from a web page.
*/
function extract_html_urls( $text )
{
$match_elements = array(
// HTML
array('element'=>'a', 'attribute'=>'href'), // 2.0
array('element'=>'a', 'attribute'=>'urn'), // 2.0
array('element'=>'base', 'attribute'=>'href'), // 2.0
array('element'=>'form', 'attribute'=>'action'), // 2.0
array('element'=>'img', 'attribute'=>'src'), // 2.0
array('element'=>'link', 'attribute'=>'href'), // 2.0
array('element'=>'applet', 'attribute'=>'code'), // 3.2
array('element'=>'applet', 'attribute'=>'codebase'), // 3.2
array('element'=>'area', 'attribute'=>'href'), // 3.2
array('element'=>'body', 'attribute'=>'background'), // 3.2
array('element'=>'img', 'attribute'=>'usemap'), // 3.2
array('element'=>'input', 'attribute'=>'src'), // 3.2
array('element'=>'applet', 'attribute'=>'archive'), // 4.01
array('element'=>'applet', 'attribute'=>'object'), // 4.01
array('element'=>'blockquote','attribute'=>'cite'), // 4.01
array('element'=>'del', 'attribute'=>'cite'), // 4.01
array('element'=>'frame', 'attribute'=>'longdesc'), // 4.01
array('element'=>'frame', 'attribute'=>'src'), // 4.01
array('element'=>'head', 'attribute'=>'profile'), // 4.01
array('element'=>'iframe', 'attribute'=>'longdesc'), // 4.01
array('element'=>'iframe', 'attribute'=>'src'), // 4.01
array('element'=>'img', 'attribute'=>'longdesc'), // 4.01
array('element'=>'input', 'attribute'=>'usemap'), // 4.01
array('element'=>'ins', 'attribute'=>'cite'), // 4.01
array('element'=>'object', 'attribute'=>'archive'), // 4.01
array('element'=>'object', 'attribute'=>'classid'), // 4.01
array('element'=>'object', 'attribute'=>'codebase'), // 4.01
array('element'=>'object', 'attribute'=>'data'), // 4.01
array('element'=>'object', 'attribute'=>'usemap'), // 4.01
array('element'=>'q', 'attribute'=>'cite'), // 4.01
array('element'=>'script', 'attribute'=>'src'), // 4.01
array('element'=>'audio', 'attribute'=>'src'), // 5.0
array('element'=>'command', 'attribute'=>'icon'), // 5.0
array('element'=>'embed', 'attribute'=>'src'), // 5.0
array('element'=>'event-source','attribute'=>'src'), // 5.0
array('element'=>'html', 'attribute'=>'manifest'), // 5.0
array('element'=>'source', 'attribute'=>'src'), // 5.0
array('element'=>'video', 'attribute'=>'src'), // 5.0
array('element'=>'video', 'attribute'=>'poster'), // 5.0
array('element'=>'bgsound', 'attribute'=>'src'), // Extension
array('element'=>'body', 'attribute'=>'credits'), // Extension
array('element'=>'body', 'attribute'=>'instructions'),//Extension
array('element'=>'body', 'attribute'=>'logo'), // Extension
array('element'=>'div', 'attribute'=>'href'), // Extension
array('element'=>'div', 'attribute'=>'src'), // Extension
array('element'=>'embed', 'attribute'=>'code'), // Extension
array('element'=>'embed', 'attribute'=>'pluginspage'),// Extension
array('element'=>'html', 'attribute'=>'background'), // Extension
array('element'=>'ilayer', 'attribute'=>'src'), // Extension
array('element'=>'img', 'attribute'=>'dynsrc'), // Extension
array('element'=>'img', 'attribute'=>'lowsrc'), // Extension
array('element'=>'input', 'attribute'=>'dynsrc'), // Extension
array('element'=>'input', 'attribute'=>'lowsrc'), // Extension
array('element'=>'table', 'attribute'=>'background'), // Extension
array('element'=>'td', 'attribute'=>'background'), // Extension
array('element'=>'th', 'attribute'=>'background'), // Extension
array('element'=>'layer', 'attribute'=>'src'), // Extension
array('element'=>'xml', 'attribute'=>'src'), // Extension
array('element'=>'button', 'attribute'=>'action'), // Forms 2.0
array('element'=>'datalist','attribute'=>'data'), // Forms 2.0
array('element'=>'form', 'attribute'=>'data'), // Forms 2.0
array('element'=>'input', 'attribute'=>'action'), // Forms 2.0
array('element'=>'select', 'attribute'=>'data'), // Forms 2.0
// XHTML
array('element'=>'html', 'attribute'=>'xmlns'),
// WML
array('element'=>'access', 'attribute'=>'path'), // 1.3
array('element'=>'card', 'attribute'=>'onenterforward'),// 1.3
array('element'=>'card', 'attribute'=>'onenterbackward'),// 1.3
array('element'=>'card', 'attribute'=>'ontimer'), // 1.3
array('element'=>'go', 'attribute'=>'href'), // 1.3
array('element'=>'option', 'attribute'=>'onpick'), // 1.3
array('element'=>'template','attribute'=>'onenterforward'),// 1.3
array('element'=>'template','attribute'=>'onenterbackward'),// 1.3
array('element'=>'template','attribute'=>'ontimer'), // 1.3
array('element'=>'wml', 'attribute'=>'xmlns'), // 2.0
);
$match_metas = array(
'content-base',
'content-location',
'referer',
'location',
'refresh',
);
// Extract all elements
if ( !preg_match_all( '/<([a-z][^>]*)>/iu', $text, $matches ) )
return array( );
$elements = $matches[1];
$value_pattern = '=(("([^"]*)")|([^\s]*))';
// Match elements and attributes
foreach ( $match_elements as $match_element )
{
$name = $match_element['element'];
$attr = $match_element['attribute'];
$pattern = '/^' . $name . '\s.*' . $attr . $value_pattern . '/iu';
if ( $name == 'object' )
$split_pattern = '/\s*/u'; // Space-separated URL list
else if ( $name == 'archive' )
$split_pattern = '/,\s*/u'; // Comma-separated URL list
else
unset( $split_pattern ); // Single URL
foreach ( $elements as $element )
{
if ( !preg_match( $pattern, $element, $match ) )
continue;
$m = empty($match[3]) ? $match[4] : $match[3];
if ( !isset( $split_pattern ) )
$urls[$name][$attr][] = $m;
else
{
$msplit = preg_split( $split_pattern, $m );
foreach ( $msplit as $ms )
$urls[$name][$attr][] = $ms;
}
}
}
// Match meta http-equiv elements
foreach ( $match_metas as $match_meta )
{
$attr_pattern = '/http-equiv="?' . $match_meta . '"?/iu';
$content_pattern = '/content' . $value_pattern . '/iu';
$refresh_pattern = '/\d*;\s*(url=)?(.*)$/iu';
foreach ( $elements as $element )
{
if ( !preg_match( '/^meta/iu', $element ) ||
!preg_match( $attr_pattern, $element ) ||
!preg_match( $content_pattern, $element, $match ) )
continue;
$m = empty($match[3]) ? $match[4] : $match[3];
if ( $match_meta != 'refresh' )
$urls['meta']['http-equiv'][] = $m;
else if ( preg_match( $refresh_pattern, $m, $match ) )
$urls['meta']['http-equiv'][] = $match[2];
}
}
// Match style attributes
$urls['style'] = array( );
$style_pattern = '/style' . $value_pattern . '/iu';
foreach ( $elements as $element )
{
if ( !preg_match( $style_pattern, $element, $match ) )
continue;
$m = empty($match[3]) ? $match[4] : $match[3];
$style_urls = extract_css_urls( $m );
if ( !empty( $style_urls ) )
$urls['style'] = array_merge_recursive(
$urls['style'], $style_urls );
}
// Match style bodies
if ( preg_match_all( '/<style[^>]*>(.*?)<\/style>/siu', $text, $style_bodies ) )
{
foreach ( $style_bodies[1] as $style_body )
{
$style_urls = extract_css_urls( $style_body );
if ( !empty( $style_urls ) )
$urls['style'] = array_merge_recursive(
$urls['style'], $style_urls );
}
}
if ( empty($urls['style']) )
unset( $urls['style'] );
return $urls;
}
Examples
Read a web page using file_get_contents( ), extract its URLs, and print them:
$text = file_get_contents( $url ); $urls = extract_html_urls( $text ); print_r( $urls );
Only print the anchor URLs:
if ( !empty( $urls['a'] ) )
print_r( $urls['a']['href'] );
Only print the image URLs:
if ( !empty( $urls['img'] ) )
print_r( $urls['img']['src'] );
Collect all the extracted URLs into one long list, ignoring the element and attribute they came from:
$all_urls = array( );
foreach ( $urls as $element_entry )
foreach ( $element_entry as $attr_entry )
$all_urls = array_merge( $all_urls, $attr_entry );
Explanation
The code for URL extraction is mostly driven by a table of element and attribute names. The table's content is derived from specifications for HTML (2.0, 3.2, 4.01, 5.0 draft), XHTML (1.0, 1.1), WML (1.3, 2.0), Web Forms (2.0 draft), and browser-specific HTML extensions. Meta tags have an additional table of conventional attributes derived from the HTTP 1.1 specification and common use. All of these are discussed in the Appendices of this article.
The code proceeds through the following steps:
- Extract all elements from the text.
- Match all standard and extension elements that contain attribute URLs.
- Match all
<meta>elements that contain HTTP header fields that contain URLs. - Match all "
style" attributes and parse their CSS for URLs. - Match all
<style>...</style>blocks and parse their CSS for URLs.
CSS text is passed to extract_css_urls(), discussed in the article How to extract URLs from a CSS file.
Finding HTML elements
PHP's preg_match_all() is used to find all text strings that start with "<" and end with ">". Technically, neither of these characters may appear in HTML except as part of an element. In practice, browsers are more lenient and allow stand-alone "<" and ">", such as in a code example like "if ( a < b )". To mimic this behavior, extract_html_urls() recognizes the start of an element only if "<" is followed by a letter.
Finding elements and attributes
PHP's preg_match() is used to match elements and attributes by name. While XHTML specifications require that element and attribute names be in lower case, HTML is more lenient and real-world content mixes case. So, expression matches always add the /i pattern modifier to do caseless matches.
Handling character encodings
HTML, XHTML, and WML text defaults to the US-ASCII character encoding. This may be overridden:
- For individual files, the
content-typedirective in the HTTP header when downloading the file may specify an alternative character encoding. This is very common. - HTML and XHTML text may include a
<meta>element to specify a character encoding. This is fairly common.
While extract_html_urls() could look for a <meta> element and adapt, it cannot handle the more common case where the encoding is set in the HTTP header. Instead, it is up to the application to determine the encoding of the text and use PHP's iconv() to convert to UTF-8 first. Thereafter, extract_html_urls() uses the /u pattern modifier to handle Unicode character matching.
For more information on character encodings, see the article on How to get a web page's content type.
Using returned URLs
All URLs are returned in an associative array of associative arrays of arrays. The outermost array's keys are element names, while inner array keys are attribute names.
Returned URLs may be absolute or relative, depending upon how they were entered in the web page. Applications will need to use the page's base URL to convert relative URLs into absolute URLs if they intend to use the URLs for link checking or other analysis.
Appendix A: HTML standards
Today's HTML is the result of three formal specifications: HTML 2.0 from 1995, HTML 3.2 from 1997, and HTML 4.01 from 1999.
HTML 2.0
HTML 2.0 defines rudimentary HTML, including the familiar <a>, <form>, <img>, and <link> elements and their URL attributes. One historical oddity is the urn field of <a>, which was never fully specified and dropped in later specifications (but several browsers support it anyway).
| Element | Attribute | URL links to... |
|---|---|---|
<a> |
href |
Destination for the anchor |
<a> |
urn |
Destination for the anchor (not fully specified) |
<base> |
href |
Absolute URL for resolving page relative URLs |
<form> |
action |
Action to take when the form is submitted |
<img> |
src |
Image to include on page |
<link> |
href |
Content linked into the page |
HTML 3.2
HTML 3.2 added more elements with URL attributes. The <applet> element supported Java applets, but it has since been deprecated in favor of <object> introduced in HTML 4.01.
| Element | Attribute | URL links to... |
|---|---|---|
<applet> |
code |
Class file for the applet |
<applet> |
codebase |
Absolute URL for resolving applet relative URLs |
<area> |
href |
Destination for the image map region |
<body> |
background |
Background image for the body |
<img> |
usemap |
Client-side image map name |
<input> |
src |
Icon image for the input control |
HTML 4.01
HTML 4.01 added even more element attributes with URLs. While <script>, <frame>, <iframe>, and <object> get wide use, there are several obscure elements with URL attributes, such as citation URLs for the <del>, <ins>, <blockquote>, and <q> elements.
| Element | Attribute | URL links to... |
|---|---|---|
<applet> |
archive |
JAR files for the applet |
<applet> |
object |
Serialized representation for the applet |
<blockquote> |
cite |
Source material for the quote |
<del> |
cite |
Explanation for deleted content |
<frame> |
longdesc |
Long description for the frame |
<frame> |
src |
Content for the frame |
<head> |
profile |
Catalog of metadata types and values |
<iframe> |
longdesc |
Long description for the frame |
<iframe> |
src |
Content for the frame |
<img> |
longdesc |
Long description for the image |
<input> |
usemap |
Client-side image map name |
<ins> |
cite |
Explanation for inserted content |
<object> |
archive |
Archive files for the object |
<object> |
classid |
Implementation for the object |
<object> |
codebase |
Absolute URL for resolving object relative URLs |
<object> |
data |
Data for the object |
<object> |
usemap |
Client-side image map name |
<q> |
cite |
Source material for the quote |
<script> |
src |
Script linked into the page |
HTML 5.0 (draft)
In 2008, HTML 5 is still an early draft. It may take several more years to reach a final specification, but a few browsers have begun to add support already. HTML 5 adds more elements for multimedia content.
| Element | Attribute | URL links to... |
|---|---|---|
<audio> |
src |
Audio content |
<command> |
icon |
Icon image for input control |
<embed> |
src |
Data for the object |
<event-source> |
src |
Server-side event source |
<html> |
manifest |
Manifest for the page's content |
<source> |
src |
Multimedia content |
<video> |
src |
Video content |
<video> |
poster |
Image when no video |
Web Forms 2.0 (draft)
Web Forms 2.0 is a working draft to extend form features to support data-driven forms and the use of form controls, like buttons, outside of a form. The draft's features are expected to be incorporated into HTML 5.0, but meanwhile the specification is reasonably stable and some browsers already support it. The specification adds a few more elements and attributes with URLs:
| Element | Attribute | URL links to... |
|---|---|---|
<button> |
action |
Action to take on a button press |
<datalist> |
data |
Data available for a list |
<form> |
data |
Data for the form |
<input> |
action |
Action to take on an input choice |
<select> |
data |
Data for the selection list |
HTML extensions
Over the years, browser makers have proposed new elements and attributes for HTML. Many of these have become part of the standards. Others have not, but remain in use.
Microsoft's HTML Elements documentation lists several non-standard elements that include URLs:
| Element | Attribute | URL links to... |
|---|---|---|
<bgsound> |
src |
Audio content |
<embed> |
pluginspage |
Plugin to be embedded |
<img> |
dynsrc |
Video content |
<img> |
lowsrc |
Low-resolution alternative image |
<input> |
dynsrc |
Video content |
<input> |
lowsrc |
Low-resolution alternative icon image |
<table> |
background |
Background image for the table |
<td> |
background |
Background image for the data cell |
<th> |
background |
Background image for the header cell |
<xml> |
src |
XML content |
Microsoft's WebTV adds a few more URL attributes:
| Element | Attribute | URL links to... |
|---|---|---|
<body> |
credits |
Credits |
<body> |
instructions |
Instructions |
<body> |
logo |
Product logo |
Netscape adds a few more (Netscape has been discontinued and its HTML extensions documentation is now gone):
| Element | Attribute | URL links to... |
|---|---|---|
<html> |
background |
Background image for the page |
<ilayer> |
src |
Content for the inline layer |
<layer> |
src |
Content for the layer |
<div> |
src |
Content for the layer |
<div> |
href |
Destination for the div as an anchor! |
Apple's Safari supports several elements anticipating features in HTML 5 and for compatibility with legacy extensions to HTML. Only one Safari-specific element includes a URL attribute:
| Element | Attribute | URL links to... |
|---|---|---|
<embed> |
code |
Data for an embedded object |
Meta elements
The <meta> element is a catch-all for generic information about the web page, such as who created it, when, and why. One important use stores HTTP header information so that a web page saved to a file by a user still has essential information, like the content type, character set encoding, etc.
The HTTP 1.1 specification lists the following HTTP header fields that contain URLs. While use of these varies from rare to very rare, they are possible in a <meta> element and are checked by extract_html_urls(). For all of these, the <meta> element's "http-equiv" attribute contains the HTTP field name and its "content" attribute contains the URL value.
| Field | URL links to... |
|---|---|
content-base |
Base URL for the page |
content-location |
URL for the page |
referer |
URL of the page that linked to the page |
location |
Redirection URL for the page |
A <meta> refresh sets "http-equiv" to "refresh" (even though it isn't an HTTP header field) and the "content" attribute to a refresh time, in seconds, and an optional URL. While refreshes are not standardized, are deprecated, and are usually very bad form, they are in common use. The "content" attribute has one of these forms:
- "123" - refresh the same page in 123 seconds.
- "123; http://example.com" - refresh to example.com in 123 seconds.
- "123; url=http://example.com" - refresh to example.com in 123 seconds.
The extract_html_urls() function handles all of these.
While there are other uses of <meta> element content, there are no formal standards and no widely used de facto standards. So, extract_html_urls() ignores these.
Special cases
There are a few special cases to handle:
- The
archiveattribute of an<object>may contain a list of URLs, space separated. - The
archiveattribute of an<applet>may contain a list of URLs, comma separated (yup, not space separated like in an<object>). - Any element may include a
styleattribute that includes CSS code. That code may use@importorurl()syntax that each contain a URL. See the article on How to extract URLs from a CSS file for further information.
Appendix B: XHTML standards
XHTML 1.0 in 2000, and XHTML 1.1 in 2007, redefined HTML elements with a more rigorous XML syntax. Most of HTML 4.01's URL-using elements are also available in XHTML 1.1, plus one new element attribute to specify XHTML's XML namespace:
| Element | Attribute | URL links to... |
|---|---|---|
<html> |
xmlns |
XHTML name space |
In 2008, an XHTML 2.0 specification is in an early draft stage and lacks sufficient detail yet to make a list of new elements and attributes that use URLs.
Appendix C: WML standards
In 2000, when mobile device makers wanted to add dynamic content to their cell phones, they approached it with a hypertext style that defined multiple "cards" in a "deck" and links between the cards. Their Wireless Markup Language, WML, borrowed some of the syntax of HTML 2.0.
Mobile device "portals" on the web may use WML, and many web browsers support it.
WML 1.2 and 1.3
WML 1.2 and 1.3 supports much of HTML 2.0, then adds a couple more elements that may contain URLs or file paths:
| Element | Attribute | URL links to... |
|---|---|---|
<access> |
path |
Access limited to other decks with this path |
<card> |
onenterforward |
Page to load going forward |
<card> |
onenterbackward |
Page to load going backward |
<card> |
ontimer |
Page to load after a timer expiration |
<go> |
href |
Destination for an anchor |
<option> |
onpick |
Page to load after an option is selected |
<template> |
onenterforward |
Page to load going forward |
<template> |
onenterbackward |
Page to load going backward |
<template> |
ontimer |
Page to load after a timer expiration |
DevGuru has a nice WML 1.2 summary.
WML 2.0
WML 2.0 redefined WML atop XHTML 1.0 and added one more element attribute with a URL:
| Element | Attribute | URL links to... |
|---|---|---|
<wml> |
xmlns |
WML name space |
The WML 1.3 and 2.0 DTDs are free, but the specifications cost money.
Appendix D: Why not use DOCTYPE?
The DOCTYPE listed on the first line of a web page refers to an XML DTD (Document Type Description) that gives a detailed specification of the HTML, XHTML, WML, or whatever syntax used by the page. Why not use this to find element attributes with URLs, instead of using a giant table of known elements?
- The DOCTYPE is optional. Older content doesn't have it.
- The DOCTYPE refers to an XML DTD that should match the page, but may not. It is common for content to refer to an HTML DTD, but include HTML extensions not found in that DTD.
- DTDs define syntax, not semantics. They say which elements have which attributes, but not what those attributes mean.
<meta>elements, for instance, have acontentattribute that contains CDATA (character data). That's all the DTD says, but we know that certain<meta>elements contain HTTP header equivalents that include URLs. The DTD doesn't tell us enough to handle these, and many other cases.
Downloads
- extract_html_urls.zip
- Includes
extract_html_urls.phpandextract_css_urls.php. The code is covered by the OSI BSD license so you can use, modify, redistribute, and sell as you see fit.
- Includes
Further reading
Related articles at NadeauSoftware.com
- PHP tip: How to extract URLs from a CSS file. Extract URLs from CSS text from style sheet files or embedded styles in HTML web pages.
- PHP tip: How to get a web page using CURL. Use PHP’s CURL (Client URL) functions to get a web file, handling web server redirects, compressed content, cookies, and user-agent strings.
- PHP tip: How to get a web page using the fopen wrappers. Use PHP's file reading functions to get a web page, handling web server redirects and user-agent strings.
- PHP tip: How to get a web page’s content type. Get the MIME type and character set from an HTTP header or from the web page content.
- PHP tip: How to extract keywords from a web page. Get a good list of keywords from a web page by getting the web page text, converting it to UTF-8, stripping away HTML tags, punctuation, symbols, and numbers, and breaking the text into words.
Web articles and specifications
- HTML 2.0 Specification. The W3C's specification from 1995.
- HTML 3.2 Specification. The W3C's specification from 1997.
- HTML 4.01 Specification. The W3C's specification from 1999.
- HTML 5.0 Draft Specification. The W3C's current draft specification.
- Web Forms 2.0 Draft Specification. The WhatWG's current draft specification, though the intent is to roll this into the HTML 5.0 specification.
- XHTML 1.0 Specification. The W3C's specification from 2000, revised in 2002.
- XHTML 1.1 Specification. The W3C's draft specification from 2007. Though technically a draft, the specification is stable.
- XHTML 2.0 Draft Specification. The W3C's current draft specification.
- WML 1.2, 1.3, and 2.0 DTDs. The Open Mobile Alliance (OMA) DTDs. The specifications are only available to OMA members, at a cost, but the DTDs are free.
- Extract All URLs on a Page. This code searches text for "http", "file", and "ftp" URL syntax. This is much simpler than the code above, but it doesn't handle relative URLs and it will pick up URLs in the body text that are not in HTML elements. This makes it of limited use for link checking, spidering, etc.
- PCRE Pattern - Extracting URLs. This article uses PHP regular expressions to search for "http", "https", and "ftp" URL syntax. Like the article above, it will not handle relative URLs and it won't distinguish between URLs in body text vs. in HTML element attributes.

Comments
Bugfix
I found this script very useful but I found one bug. sometimes attributes are broken onto multiple lines like the following:
To make this work, change the regex from:
$pattern = '/^' . $name . '\s.*' . $attr . $value_pattern . '/iu';
TO:
$pattern = '/^' . $name . '\s.*' . $attr . $value_pattern . '/siu';
So that the dot can match newlines.
Thank you! I am working on a
Thank you! I am working on a project where I need to extract urls from any given file, such as bookmark files exported from ie or ff as html. As simple as it may seem, I couldn't get this done despite endless attempts. I stumbled upon this post and tried your code, it worked right off the bat. Thank you for your detailed explanation as well!
Post new comment