Though HTML is usually the focus for extracting URLs for a link checker or analysis tool, CSS files also include URLs. The CSS @import rule uses a URL to include another CSS file, and many style properties include a URL to load an image or other content. This tip shows how to scan a CSS file and extract its URLs.
Table of Contents
This article is an independent article, but it is also associated with the PHP tip on How to extract URLs from a web page. Both of these articles are also part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, get the page's content type, convert to UTF-8, strip away HTML syntax, punctuation, symbol characters, and numbers, and break a page down into a keyword list.
Introduction
URL extraction from style sheets is useful for:
- Link checkers to find missing background images and imported files.
- Content filters to check for access to inappropriate content.
- Web page analysis tools to collect file statistics and help optimize CDN use.
Link checking can find missing files and content filters can find files that shouldn't be used. CSS file analysis can highlight when a web site design is using too many files or when those files are not optimally spread across multiple domains or hosted by a Content Distribution Network (CDN).
The code below handles URL extraction for CSS code contained in style sheet files, or embedded within HTML or XHTML content.
Code
The extract_css_urls() function below uses regular expressions to find CSS rules and properties that include URLs. The URLs are returned in an associative array of arrays. Keys to the associative array are "import" for URLs in @import rules and "property" for URLs in properties.
Usage examples and detailed explanations follow in the next sections.
Download: extract_css_urls.zip.
/**
* Extract URLs from CSS text.
*/
function extract_css_urls( $text )
{
$urls = array( );
$url_pattern = '(([^\\\\\'", \(\)]*(\\\\.)?)+)';
$urlfunc_pattern = 'url\(\s*[\'"]?' . $url_pattern . '[\'"]?\s*\)';
$pattern = '/(' .
'(@import\s*[\'"]' . $url_pattern . '[\'"])' .
'|(@import\s*' . $urlfunc_pattern . ')' .
'|(' . $urlfunc_pattern . ')' . ')/iu';
if ( !preg_match_all( $pattern, $text, $matches ) )
return $urls;
// @import '...'
// @import "..."
foreach ( $matches[3] as $match )
if ( !empty($match) )
$urls['import'][] =
preg_replace( '/\\\\(.)/u', '\\1', $match );
// @import url(...)
// @import url('...')
// @import url("...")
foreach ( $matches[7] as $match )
if ( !empty($match) )
$urls['import'][] =
preg_replace( '/\\\\(.)/u', '\\1', $match );
// url(...)
// url('...')
// url("...")
foreach ( $matches[11] as $match )
if ( !empty($match) )
$urls['property'][] =
preg_replace( '/\\\\(.)/u', '\\1', $match );
return $urls;
}
Examples
Read a CSS file using file_get_contents(), extract its URLs, and print them:
$text = file_get_contents( $url ); $urls = extract_css_urls( $text ); print_r( $urls );
Only print the @import URLs:
if ( !empty( $urls['import'] ) )
print_r( $urls['import'] );
Only print property URLs:
if ( !empty( $urls['property'] ) )
print_r( $urls['property'] );
Explanation
In CSS 2.0, URLs may be given in just two ways:
- On properties using the
url()function. - In an
@importrule to include another style sheet.
Finding CSS url() functions
The url() function takes a single URL argument enclosed in single or double quotes, or no quotes at all. URLs may not include the characters: ( ) , ' and ". To include these, precede them with a backslash or use percent-hex encoding (such as %28 for an open parenthesis).
Examples:
background-image: url(file.gif);
background-image: url('file.gif') no-repeat;
background: #FF0000 url( "file.gif" ) no-repeat ;
background: #FF0000 url( "file%20space.gif" ) no-repeat;
The extract_css_urls() function above uses preg_match() to match "url", an open parenthesis, white-space, an optional open quote, a URL, an optional close quote, white-space, and a close parenthesis. The URL itself is any sequence of characters except ( ) , ' and ", unless preceded by a backslash.
Backslash escaped characters are un-escaped.
Finding CSS @import rules
An @import rule can have one of two forms:
@import url("file.css")media-list;@import "file.css"media-list;
The first form uses the url() function, while the second omits it. Both forms may include an optional media list (such as "print" or "screen").
Examples:
@import url(file.css);
@import url('file.css');
@import url( "file.css" ) print;
@import 'file.css' ;
@import "file.css" print, screen;
The extract_css_urls() function above uses preg_match() to match "@import", white-space, and either a url() function or a URL within quotes.
Backslash escaped characters are un-escaped.
Handling upper and lower case
While CSS requires that url() and @import be in lower-case, the extract_css_urls() function accepts upper and mixed case as well to be compatible with lenient web browsers.
Handling character encodings
CSS text defaults to the US-ASCII character encoding. This may be overridden:
- For individual CSS files, the
content-typedirective in the HTTP header to download the file may specify an alternative character encoding. This is very rare. - CSS text may include an
@charsetrule to specify a character encoding. This is very rare. - CSS text embedded in an X/HTML file adopts the character encoding of that file. The most common HTML encodings are ISO-8859-1 (Latin 1) and UTF-8 (Unicode).
While extract_css_urls() could look for an @charset rule and adapt, it cannot handle the other two cases. Instead, it is up to the application to determine the encoding of the CSS text and use PHP's iconv() to convert to UTF-8. Thereafter, extract_css_urls() uses the /u pattern modifier to handle Unicode character matching. Fortunately, for the US-ASCII default and Latin-1, no special handling is needed since these are subsets of UTF-8 already.
Using returned URLs
All URLs are returned in an associative array of arrays. Outer array keys are "import" for @import URLs, and "property" for all others.
Returned URLs may be absolute or relative, depending upon how they were entered in the CSS text. Applications will need to use the file's base URL to convert relative URLs into absolute URLs if they intend to use them for link checking or other analysis.
Downloads
- extract_css_urls.zip
- Includes
extract_css_urls.php. The code is covered by the OSI BSD license so you can use, modify, redistribute, and sell as you see fit.
- Includes
Further reading
Related articles at NadeauSoftware.com
- PHP tip: How to extract URLs from a web page. Extract URLs from HTML and XHTML web pages and use the
extract_css_urls()function in this article to extract URLs from embedded CSS styles. - PHP tip: How to get a web page using CURL. Use PHP's CURL (Client URL) functions to get a web file, handling web server redirects, compressed content, cookies, and user-agent strings.
- PHP tip: How to get a web page using the fopen wrappers. Use PHP's file reading functions to get a web page, handling web server redirects and user-agent strings.
- PHP tip: How to get a web page's content type. Get the MIME type and character set from an HTTP header or from the web page content.
- PHP tip: How to extract keywords from a web page. Get a good list of keywords from a web page by getting the web page text, converting it to UTF-8, stripping away HTML tags, punctuation, symbols, and numbers, and breaking the text into words.
Web articles and specifications
- Cascading Style Sheets Level 2, CSS 2 Specification. The W3C specification for CSS 2.0.

Comments
Post new comment