PHP tip: How to get a web page content type

Technologies: PHP 4+

A web page’s content type tells you the page's MIME type (such as “text/html” or “image/png”) and the character set used by page text. You'll need the character set to interpret the page's characters for text processing for a search engine or keyword extractor. The content type should be in the web server’s HTTP header for the page, but it also can be set in an HTML file’s <meta> tag, or an XML file’s <?xml> tag. This tip shows how to get the page's content type and extract the MIME type and character set.

This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, extract URLs on the page, strip away HTML syntax, punctuation, symbol characters, and numbers, and break a page down into a keyword list.

Code

If you are using CURL...

After calling curl_exec() to get a web page, call curl_getinfo() to get the content type string from the HTTP header, such as:

text/html; charset=utf-8

Use preg_match() to get the MIME type and (optional) character set:

/* Get the content type from CURL */
$content_type = curl_getinfo( $ch, CURLINFO_CONTENT_TYPE );
 
/* Get the MIME type and character set */
preg_match( '@([\w/+]+)(;\s+charset=(\S+))?@i', $content_type, $matches );
if ( isset( $matches[1] ) )
    $mime = $matches[1];
if ( isset( $matches[3] ) )
    $charset = $matches[3];

If you are using the fopen wrappers...

After reading the web page using file_get_contents(), or one of the other file functions, use the global $http_response_header variable to get the HTTP header as an array of strings. Look for the last entry that starts with “Content-Type:” (there will be multiple entries like this if the page was redirected – the last one is for the returned page):

Content-Type: text/html; charset=utf-8

Use preg_match() to get the MIME type and (optional) character set:

/* Get the content type from the HTTP response */
$nlines = count( $http_response_header );
for ( $i = $nlines-1; $i >= 0; $i-- ) {
    $line = $http_response_header[$i];
    if ( substr_compare( $line, 'Content-Type', 0, 12, true ) == 0 ) {
        $content_type = $line;
        break;
    }
}
 
/* Get the MIME type and character set */
preg_match( '@Content-Type:\s+([\w/+]+)(;\s+charset=(\S+))?@i', $content_type, $matches );
if ( isset( $matches[1] ) )
    $mime = $matches[1];
if ( isset( $matches[3] ) )
    $charset = $matches[3];

If you have an HTML or XHTML file, but not the HTTP header...

A web page stored in a local file has no HTTP header. Read the file into a string and look for a <meta> tag for (X)HTML where the http-equiv attribute reads “Content-Type”.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Use preg_match() to find the tag and get the MIME type and (optional) character set:

/* Get the MIME type and character set */
preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
    $page, $matches );
if ( isset( $matches[1] ) )
    $mime = $matches[1];
if ( isset( $matches[3] ) )
    $charset = $matches[3];

If you have an XML file, but not the HTTP header...

XML data stored in a local file has no HTTP header. Read the file into a string and look for an <?xml> tag:

<?xml version="1.0" encoding="UTF-8" ?>

Use preg_match() to find the tag and get the character set (the MIME type is always “application/xml” for XML):

/* Get the character set */
preg_match( '@<\?xml.+encoding="([^\s"]+)@si', $page, $matches );
$mime = 'application/xml';
if ( isset( $matches[1] ) )
	$charset = $matches[1];

Example

Read an HTML file, get the character set, and convert to UTF-8:

$text = file_get_contents( $filename );

preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i', $text, $matches ); if ( isset( $matches[1] ) ) $mime = $matches[1]; if ( isset( $matches[3] ) ) $charset = $matches[3];

$utf8_text = iconv( $charset, "utf-8", $text );

See the PHP iconv() manual for how to use the function to convert from a web page’s original character set to UTF-8 (or any other character set).

Explanation

The content type of a page gives the file’s MIME type and character set:

  • MIME type: the file’s type, such as text, HTML, an image, a sound, etc. (see Wikipedia on MIME types and a list of standard Internet Media Types). Some common types include:
    MIME type Meaning
    text/plain Plain text file
    text/html HTML web page
    application/xml XML data file
    application/xhtml+xml XHTML web page
    image/jpeg JPEG image
    image/png PNG image
    image/gif GIF image
  • Character set: the character encoding of the file (see Wikipedia on Character encoding). UTF-8 is widely used for web pages because it can represent characters in any of the world’s languages. Some sites may still use language-specific encodings, such as Big5 for traditional Chinese characters, ISO 8859 for the latin character set, or any of several old Microsoft Windows-specific character sets.

When building a PHP search engine or page analysis tool, the MIME type tells you if you’ve got an HTML page, an image, a sound, or whatever. The character set tells you the encoding used by the page. For most tasks, you’ll need to convert the page to UTF-8 using iconv() before parsing the text.

Content type in the HTTP header

The content type for a page is specified by a text string included in the HTTP response header from the web server, such as:

Content-Type: text/html; charset=utf-8

After the string “Content-Type:”, the first part of the value is the MIME type and the second part the character set. The character set is optional but it should always be included for text-based content. For images, sounds, and other non-text content, there will be no character set specified.

Content type in an HTML/XHTML <meta> tag

If the HTTP header doesn’t include the content type (which may indicate a misconfigured web server), the content type may be included within a <meta> tag in HTML and XHTML files, such as:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

As with the HTTP header, the first part of the type is the MIME type, and the second part is the character set. These values are supposed to match the HTTP header content type, if given. If they don’t, the HTTP header has precedence over the <meta> tag.

The content type <meta> tag is optional, but if it is given it should be as close to the top of the <head> section as possible. This enables web browsers, and PHP code, to find the content type quickly and then use the character set name to guide handling of the rest of the page text.

Content type in an XML <?xml> tag

For XML files, the character set may be within a <?xml> tag near the top of the file, such as:

<?xml version="1.0" encoding="UTF-8" ?>

XML doesn’t specify a MIME type in the <?xml> tag. Instead, the MIME type for XML is always “application/xml”.

Further reading

Related articles at NadeauSoftware.com

Web articles and specifications

Comments

Great website!

Thank you for the information. I could not find this on any website in Poland

It is true to the Polish

It is true to the Polish pages niemożna find needed information such as can be found on this page.

Great website!

Brilliant!!!!

I have used this tutorial for my website and it works perfect.

http://blueoo.com

Regards

Very nice and clean

Thanks for these regex ;)

Awesome

This is very helpful!
I use this preg_match for my php script.
You have the greatest manual for.

I must learn regex

Very usefull article, thx for write this

Minor Fix

Thank you. I'm using your regex's in my curl_exec_utf8 function : http://stackoverflow.com/questions/2510868/php-convert-web-page-to-utf8/...

I've noticed many people don't put a space between the semicolon and the charset, so I made that optional. Other than that, thank you.

a BIG "thank you" + a small correction

David, I have NEVER found such a valuable code treasure like in your website!
It is unbelievable! You rock!!!
Then.... I read your resume... and I bow myself! :-)

Now about the code to detect the correct charset; your regular expression will not match some headers (ill behaved?) like those contained in this page:

http://www.cisco.com/web/JP/news/pr/2010/049.html?vs_f=News+RSS+Feeds

returned "Content-type" is "text/html;charset=shift_jis"

In the case above regular expression will not match charset, because of the missing space between ';' and 'c' of 'charset'.

I have modified the regex to:

'@([\w/+]+)(;\s*+charset=(\S+))?@i'

adding '*' right after '\s'

See if it works, in your opinion.

Thanks & kudos again to your valuable work

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting