PHP tip: How to extract URLs from a CSS file

Technologies: CSS 1+, PHP 4.3+, UTF-8

Though HTML is usually the focus for extracting URLs for a link checker or analysis tool, CSS files also include URLs. The CSS @import rule uses a URL to include another CSS file, and many style properties include a URL to load an image or other content. This tip shows how to scan a CSS file and extract its URLs.

This article is an independent article, but it is also associated with the PHP tip on How to extract URLs from a web page. Both of these articles are also part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, get the page's content type, convert to UTF-8, strip away HTML syntax, punctuation, symbol characters, and numbers, and break a page down into a keyword list.

Introduction

URL extraction from style sheets is useful for:

  • Link checkers to find missing background images and imported files.
  • Content filters to check for access to inappropriate content.
  • Web page analysis tools to collect file statistics and help optimize CDN use.

Link checking can find missing files and content filters can find files that shouldn't be used. CSS file analysis can highlight when a web site design is using too many files or when those files are not optimally spread across multiple domains or hosted by a Content Distribution Network (CDN).

The code below handles URL extraction for CSS code contained in style sheet files, or embedded within HTML or XHTML content.

Code

The extract_css_urls() function below uses regular expressions to find CSS rules and properties that include URLs. The URLs are returned in an associative array of arrays. Keys to the associative array are "import" for URLs in @import rules and "property" for URLs in properties.

Usage examples and detailed explanations follow in the next sections.

Download: extract_css_urls.zip.

/**
 * Extract URLs from CSS text.
 */
function extract_css_urls( $text )
{
    $urls = array( );
 
    $url_pattern     = '(([^\\\\\'", \(\)]*(\\\\.)?)+)';
    $urlfunc_pattern = 'url\(\s*[\'"]?' . $url_pattern . '[\'"]?\s*\)';
    $pattern         = '/(' .
         '(@import\s*[\'"]' . $url_pattern     . '[\'"])' .
        '|(@import\s*'      . $urlfunc_pattern . ')'      .
        '|('                . $urlfunc_pattern . ')'      .  ')/iu';
    if ( !preg_match_all( $pattern, $text, $matches ) )
        return $urls;
 
    // @import '...'
    // @import "..."
    foreach ( $matches[3] as $match )
        if ( !empty($match) )
            $urls['import'][] = 
                preg_replace( '/\\\\(.)/u', '\\1', $match );
 
    // @import url(...)
    // @import url('...')
    // @import url("...")
    foreach ( $matches[7] as $match )
        if ( !empty($match) )
            $urls['import'][] = 
                preg_replace( '/\\\\(.)/u', '\\1', $match );
 
    // url(...)
    // url('...')
    // url("...")
    foreach ( $matches[11] as $match )
        if ( !empty($match) )
            $urls['property'][] = 
                preg_replace( '/\\\\(.)/u', '\\1', $match );
 
    return $urls;
}

Examples

Read a CSS file using file_get_contents(), extract its URLs, and print them:

$text = file_get_contents( $url );
$urls = extract_css_urls( $text );
print_r( $urls );

Only print the @import URLs:

if ( !empty( $urls['import'] ) )
    print_r( $urls['import'] );

Only print property URLs:

if ( !empty( $urls['property'] ) )
    print_r( $urls['property'] );

Explanation

In CSS 2.0, URLs may be given in just two ways:

  • On properties using the url() function.
  • In an @import rule to include another style sheet.

Finding CSS url() functions

The url() function takes a single URL argument enclosed in single or double quotes, or no quotes at all. URLs may not include the characters: ( ) , ' and ". To include these, precede them with a backslash or use percent-hex encoding (such as %28 for an open parenthesis).

Examples:

background-image: url(file.gif);
background-image: url('file.gif') no-repeat;
background: #FF0000 url( "file.gif" ) no-repeat  ;
background: #FF0000 url( "file%20space.gif" ) no-repeat;

The extract_css_urls() function above uses preg_match() to match "url", an open parenthesis, white-space, an optional open quote, a URL, an optional close quote, white-space, and a close parenthesis. The URL itself is any sequence of characters except ( ) , ' and ", unless preceded by a backslash.

Backslash escaped characters are un-escaped.

Finding CSS @import rules

An @import rule can have one of two forms:

  • @import url("file.css") media-list;
  • @import "file.css" media-list;

The first form uses the url() function, while the second omits it. Both forms may include an optional media list (such as "print" or "screen").

Examples:

@import url(file.css);
@import url('file.css');
@import url(  "file.css"  ) print;
@import 'file.css'  ;
@import   "file.css"   print, screen;

The extract_css_urls() function above uses preg_match() to match "@import", white-space, and either a url() function or a URL within quotes.

Backslash escaped characters are un-escaped.

Handling upper and lower case

While CSS requires that url() and @import be in lower-case, the extract_css_urls() function accepts upper and mixed case as well to be compatible with lenient web browsers.

Handling character encodings

CSS text defaults to the US-ASCII character encoding. This may be overridden:

  • For individual CSS files, the content-type directive in the HTTP header to download the file may specify an alternative character encoding. This is very rare.
  • CSS text may include an @charset rule to specify a character encoding. This is very rare.
  • CSS text embedded in an X/HTML file adopts the character encoding of that file. The most common HTML encodings are ISO-8859-1 (Latin 1) and UTF-8 (Unicode).

While extract_css_urls() could look for an @charset rule and adapt, it cannot handle the other two cases. Instead, it is up to the application to determine the encoding of the CSS text and use PHP's iconv() to convert to UTF-8. Thereafter, extract_css_urls() uses the /u pattern modifier to handle Unicode character matching. Fortunately, for the US-ASCII default and Latin-1, no special handling is needed since these are subsets of UTF-8 already.

Using returned URLs

All URLs are returned in an associative array of arrays. Outer array keys are "import" for @import URLs, and "property" for all others.

Returned URLs may be absolute or relative, depending upon how they were entered in the CSS text. Applications will need to use the file's base URL to convert relative URLs into absolute URLs if they intend to use them for link checking or other analysis.

Downloads

Further reading

Related articles at NadeauSoftware.com

Web articles and specifications

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting