PHP tip: How to convert a relative URL to an absolute URL

Technologies: PHP 4+

An absolute URL is complete and ready to use to download a web file. But web pages often include incomplete relative URLs with missing parts, such as an "http" or host name, or the first part of a file path. These parts need to be filled in by copying them from a base absolute URL. This article shows how and includes code to do it.

Introduction

An absolute URL like "http://example.com/index.htm" tells a web browser what file to get ("index.htm"), where to get it (from "example.com"), and how to get it (via an HTTP web server). But a relative URL like "logo.png" is missing important parts, like "http", the host name, and the first part of a path. Without these parts, the URL can't be used to get a file.

To use a relative URL, you have to fill in the missing parts by copying them from another URL. This other URL, or base URL, must be absolute and it is often the URL of the web page containing the relative URL. The base URL also can be set by a <base> tag within the page, by the Content-Location field in the web server's HTTP response header, and (rarely) by a special <meta> tag.

The URL specification defines an "absolutize" algorithm for combining an absolute base URL with a relative URL to create a new absolute URL. This article explains the algorithm's steps and implements it in PHP.

The code below requires the split_url( ) and join_url( ) functions from a companion article:

PHP tip: How to parse and build URLs. Split URLs into their component parts, and reassemble those parts into a complete URL.

Code

Let's go straight to the code first. Explanations follow in the next sections.

Download url_to_absolute.zip.

This url_to_absolute( ) function's arguments include an absolute base URL and a relative URL. Missing parts of the relative URL are copied from the base URL to form a new absolute URL returned by the function. FALSE is returned if either URL can't be parsed or if the base URL isn't absolute.

function url_to_absolute( $baseUrl, $relativeUrl )
{
    // If relative URL has a scheme, clean path and return.
    $r = split_url( $relativeUrl );
    if ( $r === FALSE )
        return FALSE;
    if ( !empty( $r['scheme'] ) )
    {
        if ( !empty( $r['path'] ) && $r['path'][0] == '/' )
            $r['path'] = url_remove_dot_segments( $r['path'] );
        return join_url( $r );
    }
 
    // Make sure the base URL is absolute.
    $b = split_url( $baseUrl );
    if ( $b === FALSE || empty( $b['scheme'] ) || empty( $b['host'] ) )
        return FALSE;
    $r['scheme'] = $b['scheme'];
 
    // If relative URL has an authority, clean path and return.
    if ( isset( $r['host'] ) )
    {
        if ( !empty( $r['path'] ) )
            $r['path'] = url_remove_dot_segments( $r['path'] );
        return join_url( $r );
    }
    unset( $r['port'] );
    unset( $r['user'] );
    unset( $r['pass'] );
 
    // Copy base authority.
    $r['host'] = $b['host'];
    if ( isset( $b['port'] ) ) $r['port'] = $b['port'];
    if ( isset( $b['user'] ) ) $r['user'] = $b['user'];
    if ( isset( $b['pass'] ) ) $r['pass'] = $b['pass'];
 
    // If relative URL has no path, use base path
    if ( empty( $r['path'] ) )
    {
        if ( !empty( $b['path'] ) )
            $r['path'] = $b['path'];
        if ( !isset( $r['query'] ) && isset( $b['query'] ) )
            $r['query'] = $b['query'];
        return join_url( $r );
    }
 
    // If relative URL path doesn't start with /, merge with base path
    if ( $r['path'][0] != '/' )
    {
        $base = mb_strrchr( $b['path'], '/', TRUE, 'UTF-8' );
        if ( $base === FALSE ) $base = '';
        $r['path'] = $base . '/' . $r['path'];
    }
    $r['path'] = url_remove_dot_segments( $r['path'] );
    return join_url( $r );
}

The url_remove_dot_segments( ) function is used as a last step above to filter out "." and ".." segments from the returned URL's path. The function's only argument is the path to filter. The filtered path is returned.

function url_remove_dot_segments( $path )
{
    // multi-byte character explode
    $inSegs  = preg_split( '!/!u', $path );
    $outSegs = array( );
    foreach ( $inSegs as $seg )
    {
        if ( $seg == '' || $seg == '.')
            continue;
        if ( $seg == '..' )
            array_pop( $outSegs );
        else
            array_push( $outSegs, $seg );
    }
    $outPath = implode( '/', $outSegs );
    if ( $path[0] == '/' )
        $outPath = '/' . $outPath;
    // compare last multi-byte character against '/'
    if ( $outPath != '/' &&
        (mb_strlen($path)-1) == mb_strrpos( $path, '/', 'UTF-8' ) )
        $outPath .= '/';
    return $outPath;
}

Examples

Combine a base URL and a relative URL:

$newUrl = url_to_absolute(
    "http://example.com/products/index.htm",
    "./product.png" );
print( "$newUrl\n" );

Prints:

http://example.com/products/product.png

Extract URLs from a web page and convert each one to an absolute URL (see the article on How to extract URLs from a web page for the extract_html_urls( ) function):

// Get the web page text
$text = file_get_contents( $baseUrl );
 
// Extract URLs and convert to a single list
$groupedUrls = extract_html_urls( $text );
$pageUrls    = array( );
foreach ( $urls as $element_entry )
    foreach ( $element_entry as $attribute_entry )
        $pageUrls = array_merge( $pageUrls, $attribute_entry );
 
// Convert each URL to absolute
$n = count( $pageUrls );
for ( $i = 0; $i < $n; $i++ )
    $pageUrls[$i] = url_to_absolute( $baseUrl, $pageUrls[$i] );

Explanation

The familiar format of a URL is officially defined by the Internet Engineering Task Force in specification RFC3986. It goes like this (the parts within brackets are optional):

[scheme ":"] scheme-specific-part ["?" query] ["#" fragment]

For example, in this URL:

http://example.com/products/index.htm?sku=1234#section42

"http" is the scheme, "//example.com/products/index.htm" is the scheme-specific-part, "sku=1234" is the query, and "section42" is the fragment.

The scheme is a standard name, like "http", that indicates how to interpret and use the rest of the URL. The Internet Assigned Numbers Authority (IANA) has specifications for over 60 different URL schemes including "http", "ftp", "file", "mailto", "news", "pop", "snmp", and many more.

The scheme-specific-part is just that — it's scheme-specific. Different schemes expect different information here, such as a file path for the "http" scheme, an email address for the "mailto" scheme, or a news group name for the "news" scheme.

The query part of a URL usually contains parameters for a database query. The format is not standardized, so web sites can define queries any way they like.

A fragment at the end of a URL is usually the name of a subsection of the content. For web pages, this is the name of an anchor on the page and it's often used to mark sections of the page.

All schemes can be divided into two types: hierarchical and nonhierarchical (also called opaque). A hierarchical scheme's scheme-specific-part contains a path with a series of words separated by slashes. For schemes like "http", this path often selects a file, such as "/products/index.htm". A nonhierarchical scheme's URL, however, contains other information, such as an email address for the "mailto" scheme.

For this article, we only care about hierarchical schemes like "http", "ftp", and "file". nonhierarchical schemes cannot have relative URLs.

So, for hierarchical schemes the scheme-specific-part contains a path, preceded by an authority like this:

[scheme ":"] ["//" authority] [path] ["?" query] ["#" fragment]

And the authority part's format is:

[user [":" pass] "@"] host [":" port]

The authority includes a host name or IP address (v4 or v6). Optionally, the host may be followed by a port number (80 for a web server). Some schemes support a user name and password in front, but this is rare (and including a password in a URL is not very secure either).

An absolute URL always has a scheme and it usually has an authority and path. A relative URL does not have a scheme and it may be missing some or all of the other parts of a URL. If the path of a relative URL doesn't start with a slash, then it is also missing the first part of its path.

The point of this article, of course, is how to fill in the missing parts of a relative URL to make it an absolute URL.

Absolutizing a relative URL

When parts of a relative URL are missing, the RFC3986 specification explains how to copy them from an absolute base URL. Typically, the base URL is the URL of the web page containing the relative URL. There are several other ways to get a base URL:

  • In the web server's HTTP header when downloading the page:
    • The Content-Location field contains the base URL.

  • In the web page's <head> section:
    • The optional <base> tag's href field contains the base URL.
    • The optional <meta> tag's http-equiv attribute may contain a Content-Location field that contains the base URL.

  • Within the web page's body:
    • An <applet> tag's codebase attribute may contain the base URL for that applet only.

The URL specification's "absolutize" algorithm for relative URLs does the following:

  • Copies missing parts to the relative URL.
  • Concatenates the base and relative URL paths if needed.
  • Removes all "." path segments.
  • Collapses all ".." path segments.

This is pretty straightforward but there are some things to watch out for:

  • To copy and merge URL parts, you first have to split apart the base and relative URLs. PHP's standard parse_url( ) function can do this, but it has problems with complex and relative URLs. Instead, use the split_url( ) function from my article on How to parse and build URLs.
  • After URLs are split apart, you have to expand percent-encoding into actual characters. For example, this converts a %20 into a space. This has to be done after splitting the URL so that percent-encodings of special URL characters like @ : ? # don't confuse the URL splitting. If you use my split_url( ) function, decoding is done for you. If you use PHP's parse_url( ) instead, you'll need to decode the host, user, pass, path, query, and fragment parts yourself using PHP's standard rawurldecode( ). Do not use PHP's similar urldecode( ), which does not follow the URL specification and can garble some URLs.
  • When copying missing parts, the base URL's fragment is never copied. The base URL's query part is only copied when the relative URL has no path at all (which is rare).
  • A URL, after expanding percent-encoding, may contain multibyte UTF-8 characters (see RFC2718). String processing on the URL's parts must use PHP's multibyte character-aware functions. These all have names starting with "mb". For instance, this code uses mb_strrchr( ) to find the substring of the path up to the last slash. The standard preg_* functions also support UTF-8 when the "u" pattern modifier is included.
  • After you've got all the right URL parts, you have to percent-encode special characters. Only some parts of a URL can have percent-encoded characters, so you have to do this before reassembling the URL. Encode the host, user, pass, path, query, and fragment parts, but don't touch the scheme and port. Also, only encode the host part if it is a name, not an IPv4 or IPv6 address. Be sure to use PHP's rawurlencode( ) but not the incorrect urlencode( ) which does not follow the URL specification.
  • Finally, join the parts together again into a complete URL. If you use the join_url( ) function from my article on How to parse and build URLs, it also handles percent-encoding for you.

So, here are the algorithm's steps:

Step 1: split the relative URL into an associative array of parts.

$r = split_url( $relativeUrl );
if ( $r == FALSE )
    return FALSE;

Step 2: check if the relative URL is already absolute. If it is, update its path to remove "." and ".." using the url_remove_dot_segments( ) function discussed later in this article. Then rebuild the URL and return it.

if ( !empty( $r['scheme'] ) )
{
    if (!empty( $r['path'] ) && $r['path'][0] == '/' )
        $r['path'] = url_remove_dot_segments( $r['path'] );
    return join_url( $r );
}

Step 3: split the base URL into its parts and make sure it is absolute (it must have at least a scheme and host). If it is absolute, copy its scheme to the relative URL.

$b = split_url( $baseUrl );
if ( $b == FALSE || empty( $b['scheme'] ) || empty( $b['host'] ) )
    return FALSE;
$r['scheme'] = $b['scheme'];

Step 4: check if the relative URL has a host part. If it does, the rest of the relative URL is complete. There's nothing more to copy from the base URL. Update the relative URL's path to remove "." and "..", then rebuild the URL and return it.

if ( !empty( $r['host'] ) )
{
    if ( !empty( $r['path'] ) )
        $r['path'] = url_remove_dot_segments( $r['path'] );
    return join_url( $r );
}

Step 5: copy the missing authority parts from the base URL to the relative URL.

$r['host'] = $b['host'];
if ( isset( $b['port'] ) ) $r['port'] = $b['port'];
if ( isset( $b['user'] ) ) $r['user'] = $b['user'];
if ( isset( $b['pass'] ) ) $r['pass'] = $b['pass'];

Step 6: if the relative URL doesn't have a path (rare), then use the base URL's path and query. Since that base URL's path is already absolute, it should already have "." and ".." removed. So, rebuild the URL and return it.

if ( empty( $r['path'] ) )
{
    if ( !empty( $b['path'] ) )
        $r['path'] = $b['path'];
    if ( !isset( $r['query'] ) && isset( $b['query'] ) )
        $r['query'] = $b['query'];
    return join_url( $r );
}

Step 7: if the relative URL's path doesn't start with a slash, then merge the first part of the base URL's path (up to the last slash) with the relative URLs path. Note the use of mb_strrchr( ) for multibyte character string handling.

if ( $r['path'][0] != '/' )
{
    $base = mb_strrchr( $b['path'], '/', TRUE, 'ISO-8859-1' );
    if ( $base === FALSE ) $base = '';
    $r['path'] = $base . '/' . $r['path'];
}

Step 8: update the path to remove "." and "..", rebuild it, and return.

$r['path'] = url_remove_dot_segments( $r['path'] );
return join_url( $r );

Removing dot segments

Paths are built from a series of segments (often folder names) separated by slashes. Segments named "." and ".." have special meanings. If you think of a file path as a series of steps downward into folders, then:

  • A "." segment means "stay here".
  • A ".." segment means "go back one folder".

A "." is always redundant and can be safely removed. The path "/products/./index.htm" is the same as "/products/index.htm".

A ".." can be removed along with the segment before it. The paths "/products/../logo.png" and "/logo.png" are equivalent.

The URL specification explains how to remove "." and ".." by scanning the path character by character. You can do this more simply by splitting the path into segments at slashes and scanning through the path segment by segment. Afterwards, reassemble the path by adding slashes between the segments. This approach will handle paths will multiple "." and ".." segments and even paths with too many ".." segments, like "/one/two/../../../../..".

Dot segment removal must be careful to handle multibyte character strings in UTF-8. For instance, to split the path into an array of segments between slashes, it's tempting to use explode( ). However, it is not safe with multibyte characters (though implode( ) is). Instead, use preg_split( ) with the "u" pattern modifier.

Step 1: explode the path at every "/" to create an array of path segments.

$inSegs  = preg_split( '!/!u', $path );

Step 2: loop through the segments. Push non-dot segments onto a stack. Skip "." segments. And on a ".." segment, pop the stack to remove the previous segment.


$outSegs = array( );

foreach ( $inSegs as $seg )
{
    if ( $seg == '' || $seg == '.')
        continue;
    if ( $seg == '..' )
        array_pop( $outSegs );
    else
        array_push( $outSegs, $seg );
}

Step 3: implode the segment stack by adding a "/" between each segment to create a new path.


$outPath = implode( '/', $outSegs );

Step 4: if the original path started or ended with a "/", put that slash back in the new path. To get the last character of a multibyte character string, we can't safely use $path[strlen($path)-1]. Instead, search for the last "/" and see if it is at the end.


if ( $path[0] == '/' )
    $outPath = '/' . $outPath;
if ( $outPath != '/' &&
    (mb_strlen($path)-1) == mb_strrpos( $path, '/', 'UTF-8' ) )
    $outPath .= '/';
return $outPath;

Downloads

Further reading

Related articles at NadeauSoftware.com

Web articles and specifications

Comments

Maybe you did mean

Maybe you did mean parse_url() instead of split_url()...

Re: Maybe you did mean

Nope, I mean my split_url( ) function and definitely not the broken PHP parse_url( ) function . At the top of this article I say you need to get split_url( ) from another article:

In that article I explain that PHP's old parse_url( ) function doesn't follow the URL specification and has problems with complex URLs and many relative URLs. That's a fatal problem if you're trying to parse and convert a relative URL to an absolute URL. So, that article provides code and explanations for a better URL parser encapsulated in the split_url( ) function.

Think I've found a problem

Hi, great code and thanks for sharing, I have however found one problem.

If the relative URL starts with a '..' segment there isn't anything in the array to be popped in the url_remove_dot_segments function with the line ...

if ( $seg == '..' )
            array_pop( $outSegs );

meaning the example ...

$newUrl = url_to_absolute(
    "http://example.com/directory/sub_directory/index.htm",
    "../product.png" );
print( "$newUrl\n" );

will print out: http://example.com/product.png

instead of: http://example.com/directory/product.png

Karl

Re: Think I've found a problem

When I run your example through the function, I get the correct answer: http://example/directory/product.png

Note that the url_remove_dot_segments function is called from three places:

  • The first call is for relative URLs that have a scheme. In this case, the base URL's path is ignored. A leading ".." in the relative URL's path is an illegal path and there is no correct action.
  • The second call is for relative URLs that have a host. Again, the base URL's path is ignored and a leading ".." in the relative URL is not legal.
  • The third call is the case your example tests. But note that the code immediately above the call has concatenated the base and relative URL paths. So the input to the function is "/directory/sub_directory/../product.png". Popping the stack on ".." doesn't pop an empty stack. It gets rid of "sub_directory" and produces the correct final path: "/directory/product.png".

mb_strrchr( )

I have a version of php that doesn.t support this function so I tried strrchr() but this stripped the wrong side of the path. I used an reversed function and it worked. Nice function!

Re: mb_strrchr

PHP's standard string functions are not multibyte character safe and they will garble text that uses anything more than ASCII. The 'mb' functions are multibyte safe and are what you should use for all text read from a web page.

All recent versions of PHP support the 'mb' functions, however some sites exclude them from the build of PHP, or they do not enable them in PHP's 'php.ini' configuration file. See the Multibyte string page at the PHP web site for instructions on configuring PHP. If you'd rather not build PHP yourself, some web searching will no-doubt find a pre-built PHP for your platform that already has multibyte string handling enabled. If you are using a web hosting service, ask that service to enable multibyte strings.

bug

In section commented as // Copy base authority.

Unset relative port,user,pass before copy.

Re: bug

You're right and I've updated the code above. Thanks for catching this bug.

bugfix for function url_remove_dot_segments

First of all, thank you very much for sharing these functions! I've included this in a thumbnail generator which gets images from external websites at it works very well.

While testing I've found a minor bug: Each /0/ in a URL will be stripped out. The cause of this is the function url_remove_dot_segments.

By changing:

if ( empty( $seg ) || $seg == '.' ) {

to:

if ((!isnumber($seg)) && ( empty( $seg ) || $seg == '.' )) {

it worked ok.

function isnumber($value) {
  return (($value != '') && ( (string)(float)$value == (string)$value));
}

Re: bugfix for function url_remove_dot_segments

You're right and I've updated the code above. Thanks for catching this bug.

It is sufficient to check if $seg is an empty string without going further to see if it is numeric. So the corrected code reads "if ( $seg == '' || $seg == '.' )".

Encode on query string

I realized one problem, when you set encode to TURE in split_url() and join_url(), the query string becomes invalid.

For example:

http://www.website.com/go.php?where=home

becomes

http://www.website.com/go.php?where%3Dhome

Replacement for multibyte function

Hi Dave! Thanks for the function: just added it to a software I use on my website to provide snippets of code I paste in my posts. It works like a charm. Since my server still has PHP 4, which doesn't happen to have "mb_strrchr", I've added the following at the end of the file:
if (!function_exists ('mb_strrchr')) {
	function mb_strrchr ($haystack, $needle, $flag = null) {
		preg_match ('@(?P(.*)'.$needle.')[^'.$needle.']*$@u', $haystack, $matches);
		return $matches ['ret'];
	}
}
It doesn't do everything the original function is supposed to do, just what is required by your function (I hope so, at least...)

Update for PHP5, OOP

David,
It's been a while since this article was posted so I thought you might find this useful. I needed something that was 'classy' and 'OOP', so I combined the three files into one class file and modified the function calls to use class references i.e. $this->split_url, $this->join_url, and $this->url_remove_dot_segments.

Then I can call the function as e.g.

$U=new URL_Handler;
$newURL = $U->url_to_absolute($abs_path, $rel_path);

While this may seem to be adding needless complexity just to be 'obecty', it meets with our house rules on programming style, works better with documentor apps like PHPDoc and Doxygen, reduces the number of includes, and removes the functions from global scope. It seems to work fine in basic tests. I can forward the resulting class file to you if you want.

Re: Update for PHP5, OOP

Thanks. An OOP style is certainly appropriate. It groups together related functions and reduces the impact on the global name space. I don't use classes for these articles purely to keep the topic focused on a single task.

If you'll provide your class, or a link to it, I'll post it here for others to use.

PEAR's Net_URL2

The Net_URL2 package from PEAR includes such functionality already and is unit-tested:

require_once 'Net/URL2.php';
$base = new Net_URL2('http://example.org/foo.html');
$absolute = (string)$base->resolve('relative.html#bar');

Re: PEAR's Net_URL2

Note that this article was written in mid-2008, but the Net/URL2 package wasn't released until the end of 2011. I have not tried the package to verify that it does the same things. It may indeed be a better solution than my sample code.

Also note that the purpose of this article was to show how to approach the problem, but not necessarily provide the ultimate solution. This is not production code, unit tested with all possible variations, and maintained forever.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting