PHP tip: How to parse and build URLs

Technologies: PHP 4+

Splitting apart and rebuilding URLs is essential for link checkers, phishing detectors, spiders, and so on. PHP's standard parse_url( ) function works pretty well to parse simple URLs, but it has problems with complex and relative URLs. Once split apart, there is no standard PHP function to reassemble the URL properly. This article reviews the official syntax of URLs, discusses URL parsing complexities, and provides new PHP functions to split apart a URL and join its parts together again.

Introduction

PHP's standard parse_url( ) looks useful. It splits apart a URL and returns an associative array containing the scheme, host, path, and so on. It works well on simple URLs like "http://example.com/index.htm". However, it has problems parsing complex URLs, like "http://example.com/redirect?url=http://elsewhere.com". It is confused by some relative URLs, such as "//example.com/index.htm". And it doesn't properly handle URLs using IPv6 addresses. The parser also is not as strict as it should be and will allow illegal characters and invalid URL structure. This makes it hard to use parse_url( ) reliably for validating links in link checkers and other tools.

Annoyingly, on some URLs, parse_url( ) issues an E_WARNING — something no common library function should do. Instead, it should just return an error so that the calling code can handle the problem gracefully without extra messages being added to the PHP log or blurted onto a generated web page.

These problems may be fixed in a future version of PHP. Until then, below is code for split_url( ). Like parse_url( ), this function parses and splits a URL into its component parts. It conforms to the current URL specification and it parses a wide variety of URL formats. It also returns FALSE on errors without issuing an E_WARNING.

Below is also code for join_url( ) — a function to combine URL parts together again to build a full URL.

Both of these functions automatically handle percent-encoding of the appropriate parts of a URL.

Code

Here's the code first. Detailed explanations of the parser's regular expressions follow in the next few sections.

Download split_url.zip.

The split_url( ) function below parses and splits an absolute or relative URL into an associative array of URL components. The array keys include:

"scheme" The type of URL, such as "http" or "mailto".
"user" The user name for hierarchical URLs.
"pass" The user password for hierarchical URLs.
"host" The host name or IP address for hierarchical URLs.
"port" The host port number for hierarchical URLs.
"path" The file path for hierarchical URLs, such as "/index.htm", or the scheme's arguments for other URL types, such as "user@example.com" for "mailto".
"query" The query arguments after the path.
"fragment" The fragment name after the path and query.
function split_url( $url, $decode=TRUE )
{
    $xunressub     = 'a-zA-Z\d\-._~\!$&\'()*+,;=';
    $xpchar        = $xunressub . ':@%';

    $xscheme       = '([a-zA-Z][a-zA-Z\d+-.]*)';

    $xuserinfo     = '((['  . $xunressub . '%]*)' .
                     '(:([' . $xunressub . ':%]*))?)';

    $xipv4         = '(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})';

    $xipv6         = '(\[([a-fA-F\d.:]+)\])';

    $xhost_name    = '([a-zA-Z\d-.%]+)';

    $xhost         = '(' . $xhost_name . '|' . $xipv4 . '|' . $xipv6 . ')';
    $xport         = '(\d*)';
    $xauthority    = '((' . $xuserinfo . '@)?' . $xhost .
                     '?(:' . $xport . ')?)';

    $xslash_seg    = '(/[' . $xpchar . ']*)';
    $xpath_authabs = '((//' . $xauthority . ')((/[' . $xpchar . ']*)*))';
    $xpath_rel     = '([' . $xpchar . ']+' . $xslash_seg . '*)';
    $xpath_abs     = '(/(' . $xpath_rel . ')?)';
    $xapath        = '(' . $xpath_authabs . '|' . $xpath_abs .
                     '|' . $xpath_rel . ')';

    $xqueryfrag    = '([' . $xpchar . '/?' . ']*)';

    $xurl          = '^(' . $xscheme . ':)?' .  $xapath . '?' .
                     '(\?' . $xqueryfrag . ')?(#' . $xqueryfrag . ')?$';
 
 
    // Split the URL into components.
    if ( !preg_match( '!' . $xurl . '!', $url, $m ) )
        return FALSE;
 
    if ( !empty($m[2]) )        $parts['scheme']  = strtolower($m[2]);
 
    if ( !empty($m[7]) ) {
        if ( isset( $m[9] ) )   $parts['user']    = $m[9];
        else            $parts['user']    = '';
    }
    if ( !empty($m[10]) )       $parts['pass']    = $m[11];
 
    if ( !empty($m[13]) )       $h=$parts['host'] = $m[13];
    else if ( !empty($m[14]) )  $parts['host']    = $m[14];
    else if ( !empty($m[16]) )  $parts['host']    = $m[16];
    else if ( !empty( $m[5] ) ) $parts['host']    = '';
    if ( !empty($m[17]) )       $parts['port']    = $m[18];
 
    if ( !empty($m[19]) )       $parts['path']    = $m[19];
    else if ( !empty($m[21]) )  $parts['path']    = $m[21];
    else if ( !empty($m[25]) )  $parts['path']    = $m[25];
 
    if ( !empty($m[27]) )       $parts['query']   = $m[28];
    if ( !empty($m[29]) )       $parts['fragment']= $m[30];
 
    if ( !$decode )
        return $parts;
    if ( !empty($parts['user']) )
        $parts['user']     = rawurldecode( $parts['user'] );
    if ( !empty($parts['pass']) )
        $parts['pass']     = rawurldecode( $parts['pass'] );
    if ( !empty($parts['path']) )
        $parts['path']     = rawurldecode( $parts['path'] );
    if ( isset($h) )
        $parts['host']     = rawurldecode( $parts['host'] );
    if ( !empty($parts['query']) )
        $parts['query']    = rawurldecode( $parts['query'] );
    if ( !empty($parts['fragment']) )
        $parts['fragment'] = rawurldecode( $parts['fragment'] );
    return $parts;
}

The join_url( ) function here assembles an associative array of URL components back into a full URL. It works with the output of split_url( ), PHP's standard parse_url( ) function, or with your own values.

function join_url( $parts, $encode=TRUE )
{
    if ( $encode )
    {
        if ( isset( $parts['user'] ) )
            $parts['user']     = rawurlencode( $parts['user'] );
        if ( isset( $parts['pass'] ) )
            $parts['pass']     = rawurlencode( $parts['pass'] );
        if ( isset( $parts['host'] ) &&
            !preg_match( '!^(\[[\da-f.:]+\]])|([\da-f.:]+)$!ui', $parts['host'] ) )
            $parts['host']     = rawurlencode( $parts['host'] );
        if ( !empty( $parts['path'] ) )
            $parts['path']     = preg_replace( '!%2F!ui', '/',
                rawurlencode( $parts['path'] ) );
        if ( isset( $parts['query'] ) )
            $parts['query']    = rawurlencode( $parts['query'] );
        if ( isset( $parts['fragment'] ) )
            $parts['fragment'] = rawurlencode( $parts['fragment'] );
    }
 
    $url = '';
    if ( !empty( $parts['scheme'] ) )
        $url .= $parts['scheme'] . ':';
    if ( isset( $parts['host'] ) )
    {
        $url .= '//';
        if ( isset( $parts['user'] ) )
        {
            $url .= $parts['user'];
            if ( isset( $parts['pass'] ) )
                $url .= ':' . $parts['pass'];
            $url .= '@';
        }
        if ( preg_match( '!^[\da-f]*:[\da-f.:]+$!ui', $parts['host'] ) )
            $url .= '[' . $parts['host'] . ']'; // IPv6
        else
            $url .= $parts['host'];             // IPv4 or name
        if ( isset( $parts['port'] ) )
            $url .= ':' . $parts['port'];
        if ( !empty( $parts['path'] ) && $parts['path'][0] != '/' )
            $url .= '/';
    }
    if ( !empty( $parts['path'] ) )
        $url .= $parts['path'];
    if ( isset( $parts['query'] ) )
        $url .= '?' . $parts['query'];
    if ( isset( $parts['fragment'] ) )
        $url .= '#' . $parts['fragment'];
    return $url;
}

Examples

Split an "http" URL into parts (actually, "http" URLs can't have a user name and password like the one below, but this generically illustrates all the parts that could be present in a URL):

$parts = split_url(
    "http://name:pass@example.com:80/products?sku=1234#price" );
print_r( $parts );
Prints:

array(
    'scheme' => 'http',
    'user' => 'name',
    'pass' => 'pass',
    'host' => 'example.com',
    'port' => '80',
    'path' => '/products',
    'query' => 'sku=1234',
    'fragment' => 'price',
)

Join parts into an "http" URL:


$url = join_url( $parts );
print( $url );

Prints:

http://name:pass@example.com:80/products?sku=1234#price

Review of URL syntax

The familiar URL format dates from the 1994 RFC1738 specification from the Internet Engineering Task Force. Since then, there have been a stream of additional specifications to clarify and expand URLs for web file paths, email addresses, news groups, fax numbers, dictionary words, television channels, and lots more. The current version of the URL specification is RFC3986 from 2005.

The format of a URL goes like this (the parts within brackets are optional):

URL = [ scheme ":" ] scheme-specific-part [ "?" query ] [ "#" fragment ]

The scheme name indicates how to interpret and use the rest of the URL. For the "http" and "ftp" schemes, the rest of the URL selects a file from a server. For the "mailto" scheme, it includes one or more email addresses and for the "news" scheme it includes a news group name.

After the scheme-specific-part, some schemes support query and fragment parts. For "http", the query part often contains parameters for a database query, while for "mailto" it may include an email message's subject, body, and more. A fragment name usually selects a section of a web page.

There are 60+ different URL schemes registered with the Internet Assigned Numbers Authority (IANA). Each one has its own specification for parsing the scheme-specific-part of a URL.

The ultimate URL parsing code would handle all 60+ different URL schemes. However, for many uses this level of effort isn't required. Instead, generic URL parsers in Java, Perl, and in this PHP article, simply extract the URL parts common to many schemes. The scheme-specific handling of those parts is left to the application or other library functions.

Common URL parts

At a minimum, a generic URL parser can extract the scheme, scheme-specific-part, query, and fragment parts. But for some cases it can go a bit further.

All URL schemes are divided into two types: hierarchical and nonhierarchical (also called opaque).

  • A hierarchical URL's scheme-specific-part contains a path with a series of words separated by slashes. The "http" and "ftp" schemes are hierarchical.
  • A nonhierarchical URL uses the scheme-specific-part for something else. The "mailto" and "news" schemes are nonhierarchical.

For nonhierarchical URLs, the scheme-specific-part varies too much from scheme to scheme to be parsed by a generic URL parser. Instead, the scheme-specific-part is returned to the application as-is so that the application can figure out what to do.

For hierarchical URLs, however, the scheme-specific-part has a standard generically parsable format:

scheme-specific-part = [ "//" authority ] [ path ]

The authority part contains the host name or IP address for the server to contact. This part may include a port number at the end (such as 80 for web servers) and sometimes a user name and password in front:

authority = [ user [ ":" pass ] "@" ] host [ ":" port ]

Schemes like "telnet" and "pop" accept a user and password. Most other schemes do not. When a password is allowed, it is strongly discouraged for security reasons and the syntax has been deprecated.

The host is usually a name like "www.example.com", but a URL may include an IP address instead. For a classic IPv4 address, the address is four groups of digits separated by dots, as in "http://127.0.0.1/index.php". For an IPv6 address, the address is a series of hexadecimal digits separated by colons (and sometimes dots) and surrounded by square brackets, as in "http://[2001:0db8:0000:0000:0000:0000:1428:57ab]/index.php". The square brackets here are only used when an IPv6 address is in a URL — they are not normally part of an IPv6 address. See RFC2396 for more about IPv4 address rules, and RFC2732 for IPv6 address rules.

Absolute and relative URLs

Hierarchical URLs have two forms: absolute and relative:

  • An absolute URL includes a scheme.
  • A relative URL doesn't have a scheme.

The path of a hierarchical URL can be absolute or relative too:

  • An absolute path starts with a slash. Also, any URL with an authority part has an absolute path.
  • A relative path doesn't start with a slash.

This gives you several possible combinations. You can have an absolute URL with an absolute path ("http://example.com/images/logo.png"), a relative URL with an absolute path ("/images/logo.png"), or a relative URL with a relative path ("../images/logo.png"). Though very unusual, the URL rules even allow an absolute URL with a relative path ("http:../images/logo.png") or a relative URL with an absolute path and an authority ("//example.com/images/logo.png").

A relative URL doesn't provide enough information for a web browser to get the selected file. The relative URL must be converted to an absolute URL first by copying missing parts from an absolute base URL. In most cases, the base URL is that for the web page containing the relative URL. For example, say a web page has an absolute URL like "http://example.com/products/index.htm". That page includes an <img> tag with a relative URL like "logo.png". To get this image, a browser copies the scheme, authority, and the first part of the path from the page's URL to create a new absolute URL like "http://example.com/products/logo.png".

Relative-to-absolute URL conversion has several rules to follow. I cover these in a separate article:

PHP tip: How to convert a relative URL to an absolute URL. Combine parts of a base absolute URL and a relative URL to form a new absolute URL for a page, image, style sheet, or script.

Putting this all together, the parts to parse from a URL are (in order):

  • scheme — letters, digits, and some punctuation up to the first colon.
  • authority — two slashes and all characters up to the next slash, question mark, hash symbol, or end of string.
    • user — all characters up to a colon or at-sign within the authority.
    • pass — all characters after a colon and up to an at-sign within the authority.
    • host — all characters up to a colon or the end of the authority.
    • port — all digits after a colon and to the end of the authority.
  • path — all characters up to the next question mark, hash symbol, or end of string.
  • query — all characters from the first question mark up to a hash symbol or end of string.
  • fragment — all characters from the first hash symbol to the end of string.

For nonhierarchical URLs, there will be no authority and the scheme-specific-part will be interpreted generically as the path.

Legal URL characters

URLs only allow a limited set of ASCII characters, including digits, letters, and a few punctuation characters:

a-z A-Z 0-9 - . _ ~ : / ? # [ ] @ ! $ & ' ( ) * + , ; =

Spaces and some ASCII punctuation characters are not allowed, including:

{ } < > | \ ~ ` "

In a URL, you can include these special ASCII characters, and other single-byte Latin-1 (ISO 8859-1) characters, by using percent-encoding. This is done by replacing each special character with a percent sign (%) and a two-digit hexadecimal number for the Latin-1 character. For example, to include a space within a URL, replace it with %20. A vertical bar | is %7C, the copyright symbol © is %169, and the Yen symbol ¥ is %165.

For characters beyond those in ASCII or ISO-Latin, RFC2718 recommends that characters be encoded first as Unicode's UTF-8 multibyte characters. Then, each byte of those characters is percent-encoded. This is pretty awkward, but it does allow a URL to be built exclusively using ASCII while still supporting the full breadth of characters in Unicode. The implication for a URL parser is that an encoded URL is strictly ASCII, but a decoded URL is in UTF-8 and may include multibyte characters. The parser will have to take care to use string functions that are safe with multibyte characters.

Percent-encoding is only allowed in certain parts of a URL, though. You can't percent-encode within the scheme, an IP address, or a host port number. You can use percent-encoding within the user, password, host, path, query, and fragment.

When handling a URL, the URL must be parsed first, and then the relevant parts decoded. This insures that a percent-encoded / : @ # or ? does not get converted into one of those characters and then confuse the parser. Similarly, when building a URL, the relevant parts need to be percent-encoded first and then assembled into a URL.

PHP has four functions involved in percent-encoding: urlencode( ), rawurlencode( ), urldecode( ), and rawurldecode( ). Despite the names, urlencode( ) and urldecode( ) should never be used for URL encoding and decoding. These legacy functions have "extra" features that do not conform to the URL specification and they can produce incorrect results for some URLs. Instead, always use rawurlencode( ) and rawurldecode( ), which work properly and conform to the specification.

Explanation of split_url( )

There are two parsing approaches we can use: loose and strict.

A loose parser simply divides a URL string at boundaries delimited by the : @ / ? and # characters. There are no restrictions about the format of text in between these delimiters. This is the parsing approach used by many URL parsers and it is roughly the way PHP's parse_url( ) behaves. The URL specification even recommends a regular expression to do this:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6           8 9

scheme = $2 authority = $4 path = $5 query = $7 fragment = $9

The parenthesis above are numbered to show you where the scheme, authority, path, query, and fragment start. The authority can be further parsed in the same way.

A drawback of this parsing approach is that it is easily confused by URLs that can be parsed several different ways. The only way to choose the right way is to get more specific about the structure and legal characters for the individual URL parts. This is the core problem with parse_url( ) — it's parser is too loose to handle complex or ambiguous URLs.

To do better, a strict parser is needed that more closely follows the formal syntax of a URL. This is the approach taken by split_url( ) here. Let's walk through the code step by step to build a large regular expression that splits the URL into the parts we need.

Step 1. Define the allowed characters. The specification divides characters into three groups: gen-delims, sub-delims, and unreserved. The gen-delims are the delimiter characters between major URL parts, like : / ? # [ ] @. The sub-delims are characters sometimes used within a URL part to delimit smaller pieces, such as a keyword=value piece of a query. And the unreserved characters are everything else, such as letters and digits. For our purposes, we can combine the unreserved and sub-delims into one set.

$xunressub     = 'a-zA-Z\d\-._~\!$&\'()*+,;=';

The specification's pchar (path characters) includes this same set of characters, plus : and @. The latter two characters are only delimiters at the start of a URL, after which they are legal to use within a path.

$xpchar        = $xunressub . ':@%';

Step 2. Define the parts. Now, let's go through the parts of a URL and write a regular expression to parse each one.

A scheme is a string of letters, digits, +, -, and ., but it must start with a letter:

$xscheme        = '([a-zA-Z][a-zA-Z\d+-.]*)';

The user:password part of a URL's authority is an optional user name followed by an optional colon and password. The user name and password may include the unreserved and sub-delims characters, plus the % character for percent-encoding.

$xuserinfo     = '((['  . $xunressub . '%]*)' .
                 '(:([' . $xunressub . ':%]*))?)';

The host part of an authority may be an IPv4 address, an IPv6 address, or a name. An IPv4 address is four groups of one to three digits, separated by dots. An IPv6 address is a series of hexadecimal digit groups, separated by colons or dots, and surrounded by square brackets. And a host name is a sequence of letters, digits, and - and ., plus % for percent-encoded characters.

$xipv4         = '(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})';
$xipv6         = '(\[([a-fA-F\d.:]+)\])';
$xhost_name    = '([a-zA-Z\d-.%]+)';
$xhost         = '(' . $xhost_name . '|' . $xipv4 . '|' . $xipv6 . ')';

The port part of an authority is a string of digits. And the authority is an optional user:password, host, and optional port.

$xport         = '(\d*)';
$xauthority    = '((' . $xuserinfo . '@)?' . $xhost .
                 '?(:' . $xport . ')?)';

A hierarchical path is a series of segments that each start with a slash:

$xslash_seg    = '(/[' . $xpchar . ']*)';

A path has three forms: an authority and absolute path, an absolute path without an authority, or a relative path without an authority. The authority always starts with two slashes. An absolute path always starts with one slash and at least one non-slash character. And a relative path never starts with a slash.

$xpath_authabs = '((//' . $xauthority . ')((/[' . $xpchar . ']*)*))';
$xpath_rel     = '([' . $xpchar . ']+' . $xslash_seg . '*)';
$xpath_abs     = '(/(' . $xpath_rel . ')?)';
$xapath        = '(' . $xpath_authabs . '|' . $xpath_abs .
                 '|' . $xpath_rel . ')';

The query and fragment parts of a URL both allow the same set of legal characters: a pchar plus a slash and question mark (so a query can include another question mark, like "http://example.com/products?sku=?").

$xqueryfrag    = '([' . $xpchar . '/?' . ']*)';

And finally, a URL is an optional scheme, an optional authority and path, an optional query, and an optional fragment.

$xurl          = '^(' . $xscheme . ':)?' .  $xapath . '?' .
                 '(\?' . $xqueryfrag . ')?(#' . $xqueryfrag . ')?$';

Step 3. Parse the URL. Use preg_match( ) to parse the URL into an array of parenthesized parts. Return FALSE if the URL is so malformed that none of its parts can be recognized.

if ( !preg_match( '!' . $xurl . '!', $url, $m ) )
        return FALSE;

Step 4. Collect the parts. Run through the array of parsed parts and collect the relevant ones. Watch for cases where the URL has delimited a part, but left the part empty. For instance, in "ftp://user:@example.com", the password is delimited by the ":" after "user", but it's left empty. In "file:///volumes/home", the authority between the first "//" and the "/volumes/home" is empty. In "http://example.com/products?" the query is introduced by a question mark, but left empty. And in "http://example.com/products#" the hash sign starts an empty fragment. These are all legal URLs and the parser must return values that let the application distinguish between a nonexistent part and an empty part. Here, this is done by leaving a part unset if it is nonexistent, and including an empty string if it exists but is empty.

if ( !empty($m[2]) )        $parts['scheme']  = strtolower($m[2]);
 
if ( !empty($m[7]) ) {
    if ( isset( $m[9] ) )   $parts['user']    = $m[9];
    else            $parts['user']    = '';
}
if ( !empty($m[10]) )       $parts['pass']    = $m[11];
 
if ( !empty($m[13]) )       $h=$parts['host'] = $m[13];
else if ( !empty($m[14]) )  $parts['host']    = $m[14];
else if ( !empty($m[16]) )  $parts['host']    = $m[16];
else if ( !empty( $m[5] ) ) $parts['host']    = '';
if ( !empty($m[17]) )       $parts['port']    = $m[18];
 
if ( !empty($m[19]) )       $parts['path']    = $m[19];
else if ( !empty($m[21]) )  $parts['path']    = $m[21];
else if ( !empty($m[25]) )  $parts['path']    = $m[25];
 
if ( !empty($m[27]) )       $parts['query']   = $m[28];
if ( !empty($m[29]) )       $parts['fragment']= $m[30];

Step 5. Decode percent-encoding. Use rawurldecode( ) to decode percent-encoding on the user, pass, path, query, and fragment parts, and on the host part if it was a host name, and not an IP address.

if ( !$decode )
    return $parts;
if ( !empty($parts['user']) )
    $parts['user']     = rawurldecode( $parts['user'] );
if ( !empty($parts['pass']) )
    $parts['pass']     = rawurldecode( $parts['pass'] );
if ( !empty($parts['path']) )
    $parts['path']     = rawurldecode( $parts['path'] );
if ( isset($h) )
    $parts['host']     = rawurldecode( $parts['host'] );
if ( !empty($parts['query']) )
    $parts['query']    = rawurldecode( $parts['query'] );
if ( !empty($parts['fragment']) )
    $parts['fragment'] = rawurldecode( $parts['fragment'] );
return $parts;

That's it. The URL is parsed.

This set of regular expressions closely follows the URL specification, but does diverge slightly:

  • IPv4 addresses must be four groups of numbers in the range 0 to 255. This parser doesn't make sure the numbers are in this range. IPv4 address validity checking is left to the application and network functions elsewhere in PHP.
  • IPv6 addresses have many forms with groups of numbers separated by colons and dots. This parser doesn't look at IPv6 address structure. Instead, IPv6 address validity checking is left to networking functions.
  • IPv6 addresses can have a "scope id" appended to the address. The URL specification doesn't allow this and it isn't clear that it is useful in a URL context. So, this parser doesn't allow IPv6 scope ids.
  • The specification optimistically defines an "IPvFuture" syntax for an undefined future form of IP address. Since that form isn't defined yet, there's little point in parsing it. So, this parser doesn't.
  • Host names have a hierarchical format defined by the Domain Name Service (DNS) RFC1035 specification, such as "www.example.com". The URL syntax, and this parser, just treats a host name as a string of characters without structure. Host name validity checking is left to DNS functions.
  • Host names are supposed to start with a letter. Again, host name validity checking is left to DNS functions.
  • Port numbers are just a string of digits in the URL specification. In practice, they are limited to five digits and the range 0 to 65535 (16 bits). Port number validity checking is left to networking functions.
  • A percent-encoded byte starts with a percent character followed by two hexadecimal digits. This parser just allows a percent character in URL parts that allow percent-encoding. It doesn't check that two hex digits follow each percent character. Encoding validity checking is left to rawurldecode( ).
  • For a relative URL without an authority, the first path segment may not contain a colon. This parser doesn't enforce that restriction. It would have made the parser's code even more complicated.
  • The user:password format in a URL has been deprecated since it exposes a password as clear text within the URL. Nevertheless, this parser recognizes the format.

Explanation of join_url( )

Reversing the parsing process is considerably easier. Concatenate the scheme, user, pass, host, port, path, query, and fragment. Optionally percent-encode everything except the scheme, host IP addresses, and the port. For an IPv6 address, add square brackets around the address.

Step 1. Encode special characters using percent-encoding. Only encode the host if it is a name, and not an IPv4 or IPv6 address. Technically, percent-encoding a slash in a path is fine, but conventionally slashes are left as "/" instead of "%2F". So, the code here uses preg_replace( ) to put the slashes back.

if ( $encode )
{
    if ( isset( $parts['user'] ) )
        $parts['user']     = rawurlencode( $parts['user'] );
    if ( isset( $parts['pass'] ) )
        $parts['pass']     = rawurlencode( $parts['pass'] );
    if ( isset( $parts['host'] ) &&
        !preg_match( '!^(\[[\da-f.:]+\]])|([\da-f.:]+)$!ui', $parts['host'] ) )
        $parts['host']     = rawurlencode( $parts['host'] );
    if ( !empty( $parts['path'] ) )
        $parts['path']     = preg_replace( '!%2F!ui', '/',
            rawurlencode( $parts['path'] ) );
    if ( isset( $parts['query'] ) )
        $parts['query']    = rawurlencode( $parts['query'] );
    if ( isset( $parts['fragment'] ) )
        $parts['fragment'] = rawurlencode( $parts['fragment'] );
}

Step 2. Add the scheme if present.

$url = '';
if ( !empty( $parts['scheme'] ) )
    $url .= $parts['scheme'] . ':';

Step 3. Add the authority if present. The authority must have a host, but the user, pass, and port are optional. Add square brackets around IPv6 addresses if they aren't there already. When there is an authority, prepend a slash to the URL's path if the path doesn't already have one.


if ( isset( $parts['host'] ) )
{
    $url .= '//';
    if ( isset( $parts['user'] ) )
    {
        $url .= $parts['user'];
        if ( isset( $parts['pass'] ) )
            $url .= ':' . $parts['pass'];
        $url .= '@';
    }
    if ( preg_match( '!^[\da-f]*:[\da-f.:]+$!ui', $parts['host'] ) )
        $url .= '[' . $parts['host'] . ']'; // IPv6
    else
        $url .= $parts['host'];             // IPv4 or name
    if ( isset( $parts['port'] ) )
        $url .= ':' . $parts['port'];
    if ( !empty( $parts['path'] ) && $parts['path'][0] != '/' )
        $url .= '/';
}

Step 4. Finally, add the path, query, and fragment, if present.


if ( !empty( $parts['path'] ) )
    $url .= $parts['path'];
if ( isset( $parts['query'] ) )
    $url .= '?' . $parts['query'];
if ( isset( $parts['fragment'] ) )
    $url .= '#' . $parts['fragment'];
return $url;

Downloads

Further reading

Related articles at NadeauSoftware.com

Web articles and specifications

Comments

Bug in url_remove_dot_segments()

Thanks for building such a robust URL conversion utility. I have however found a small bug in your url_remove_dot_segments() method.

The line:
if ( empty( $seg ) || $seg == '.' )
ruins some URls.

According to http://ca.php.net/manual/en/function.empty.php empty will assume that 0 and '0' are 'false' and therefore don't deserve to be output as part of the cleaned URL.

It should be replaced with this line instead:
if ( $seg === '' || $seg == '.' )
Using the triple equals operator will make sure that the correct test for an empty string will take place.

Very informative

Very informative article....:)

Very nice

Hello,
This is a very nice article and thank you very much.

very helpful

I really like your article and written style. It is very helpful and clear.Thank you.

Thanks for this amazing

Thanks for this amazing article my friend. It was so important for me to understand the process of Splitting apart and rebuilding URLs so you have definitely helped me. I have started my job in the software development company so I don't want to feel like a newbie there you know. It is a shame that I don't know such easy things as Url parsing.. But I will read your article and learn hardly. Thanks one more time for this useful tutorial and I will be waiting for other nice ones from you.

Sincerely,

Travis Troy

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting