PHP tip: How to decode HTML entities on a web page

Technologies: PHP 4.3.0+, UTF-8

HTML entities encode special characters and symbols, such as € for €, or © for ©. When building a PHP search engine or web page analysis tool, HTML entities within a page must be decoded into single characters to get clean parsable text. PHP’s standard html_entity_decode() function will do the job, but you must use a rich character encoding, such as UTF-8, and multibyte character strings. This tip shows how.

This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, get the page's content type, convert to UTF-8, strip away HTML tags, punctuation, symbol characters, and numbers, and break a page down into a keyword list.

Code

HTML's character reference syntax enables a web page to use special characters that aren't supported by the page's normal character encoding. For example, if you use the ISO_8859-1 Latin-1 encoding common for English, German, Spanish, and other European languages, then the encoding only supports 191 printable characters. That gets you basic punctuation, numbers, and letters with and without some diacritical marks. But there is no character for the Thai Baht currency ฿, the trademark symbol ™, a right arrow →, the Greek letter PI π, the planet Mercury ☿, or a check mark ✓. To add these or thousands of other symbols to a page, you either need to switch to a richer character encoding, like UTF-8, or use HTML character references.

There are three forms for an HTML character reference:

  • Name. Enter an ampersand, the name of a character, and a semi-colon. Example: © gives ©
  • Decimal. Enter an ampersand, a #, the decimal number of a Unicode character, and a semi-colon. Example: © gives ©
  • Hexadecimal. Enter an ampersand, a #, an x, the hexadecimal number of a Unicode character, and a semi-colon. Example: © gives ©

The named form of a character reference is called an HTML entity. There are few hundred of these entities defined in HTML 4. But there are over 100,000 characters in Unicode available with decimal or hexadecimal entities. However, font support for all of these characters is still incomplete on today's Windows, Mac, and Linux systems. Before you use Unicode characters on a web page, do some testing on different browsers for different operating systems.

To do text processing on a web page, you need to convert HTML entities and numeric character references into normal characters. You'll need to use a Unicode-based encoding, such as UTF-8 or UTF-16, that's capable of representing all of these characters. PHP has two functions that can decode character references into Unicode characters: html_entity_decode() and mb_convert_encoding(). Both are easy to use.

Using html_entity_decode

The html_entity_decode() function converts HTML character references into characters:

$utf8_text = html_entity_decode( $text, ENT_QUOTES, "utf-8" ); 

The 1st argument is the text string to decode. The decoded version of the string is returned.

The 2nd argument tells the function how to treat quotes. Use ENT_QUOTES to convert HTML entity single and double quotes back to normal quote characters.

The 3rd argument selects the character set to decode into. The argument is optional and it defaults to "ISO_8859-1" (Latin-1). However, with this default, html_entity_decode() only decodes character references with Latin-1 equivalents. As I noted above, there are only 191 printable Latin-1 characters. All other character references are left undecoded, which can confuse further text processing. To decode all HTML character references you must use a Unicode encoding like "utf-8" as the 3rd argument.

Using mb_convert_encoding

The mb_convert_encoding() function is one of many multibyte character functions in PHP. It can be used to convert between about 60 different encodings. One such encoding is for HTML entities and character references:

$utf8_text = mb_convert_encoding( $text, "utf-8", "HTML-ENTITIES" );

The 1st argument is the text string to decode, and the decoded version is returned.

The 2nd argument is the encoding to return, such as "utf-8".

The 3rd argument selects the encoding to convert from. In this case, "HTML-ENTITIES" tells the function to convert HTML entities into UTF-8 characters.

This function works well for named HTML entities and decimal character references. Unfortunately, as of PHP 5.2, mb_convert_encoding() has a known bug that improperly converts hexadecimal HTML entities into nonsense characters. Until this is fixed and widely available, use html_entity_decode().

Example

Read an HTML file, convert to UTF-8, strip out tags, and decode HTML entities:

/* Read an HTML file */
$raw_text = file_get_contents( $filename );
 
/* Get the file's character encoding from a <meta> tag */
preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
$raw_Text, $matches ); $encoding = $matches[3]; /* Convert to UTF-8 before doing anything else */ $utf8_text = iconv( $encoding, "utf-8", $raw_text ); /* Strip HTML tags */ $utf8_text = strip_tags( $utf8_text ); /* Decode HTML entities */ $utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "utf-8" );

Explanation

Every web page uses a "character encoding" that can be specified in a web server's response header or in a <meta> tag at the top of the HTML file. In the past, the old ASCII or Latin-1 encodings were widely used. Both of these represent characters with a single byte. This supports 256 different characters, but some of these are reserved for unprintable "control" characters. In ASCII, this only leaves 95 printable characters, and Latin-1 has only 191. This is far too few to represent the many thousands of different letters, digits, and punctuation symbols in the world's languages. There are also character encodings for Chinese, Japanese, and others that use multiple bytes per character, but they too are limited to only those symbols used by the language.

Ideally, everybody will convert their web pages to use the generic UTF-8 character encoding, which can directly support over 100,000 characters in the international Unicode character set. This is slowly happening, but there are still millions of legacy pages using language-specific encodings. And many web authoring tools that support UTF-8 still default to some other encoding. Unless web authors know to change their settings to UTF-8, they continue to create non-UTF-8 pages. And as long as this is the case, HTML character references will be embedded in those pages whenever the author needs to use special characters.

If you're processing HTML text to build a keyword list for a search engine, or filter text for spam and bad language, you'll need to convert character references into normal characters first. If you don't, your text parser will see &copy; as six separate letters instead of the single © copyright symbol.

There are three parts to handling HTML character references:

  • Enable PHP's multibyte string extension.
  • Decode HTML entities into UTF-8.
  • Use multibyte string functions for further processing of the decoded string.

Getting multibyte character string support in PHP

The Multibyte String Extension is available for PHP 4.3 and later. It is essential for handling international text.

This extension must be enabled when compiling the PHP engine. Fortunately, most PHP distributions come with this extension already enabled. You can check your installation by running the phpinfo() function and looking in the Configure Command section at the top of its output for --enable-mbstring.

The extension has several configuration options that you can set in your php.ini file. One feature enables the extension to be set to automatically convert between encodings. However, this can be dangerous because it presumes that all text should be converted in the same way, and without intervention by your own code. It is usually best to leave the extension's automatic conversions disabled. This is the default.

Decoding HTML character references into multibyte character strings

PHP's html_entity_decode() function uses a translation table to convert HTML named entities and character references into returned characters. The 3rd argument to the function selects the translation table to use. If you use the default Latin-1 table, multibyte character entities will not be translated. To decode all HTML entities and generate multibyte characters, you have to decode into a richer character encoding, like "utf-8":

$utf8_text = html_entity_decode( $text, ENT_QUOTES, "utf-8" ); 

As I noted earlier, PHP's mb_convert_encoding() function can do the same thing. Unfortunately, as of PHP 5.2 it has a bug that prevents it from converting hexadecimal character references. Since these are quite common, it currently isn't safe to use mb_convert_encoding().

If the page includes HTML tags, strip them out before decoding HTML entities. Decoding some entities, such as &lt; and &gt;, generates characters that will confuse tag parsing.

Handling multibyte character strings

The html_entity_decode() function's returned string will include multibyte characters if any HTML entities needed them. Many standard PHP functions are not multibyte character-aware. The strlen() function, for instance, actually counts bytes, not characters, and it will return a wrong answer when used with multibyte characters.

To process multibyte characters, you must use multibyte string functions, such as mb_strlen(). There are multibyte character equivalents for many of PHP's string functions. When using these multibyte string functions, remember to pass "utf-8" as the encoding type on every function call. There is nothing in the string itself to indicate the encoding it is using. If you aren't explicit, the functions will use the default PHP internal encoding, which may not be UTF-8. You could get garbled text.

While there are no multibyte character equivalents for the preg_* functions, these functions can handle UTF-8 on their own if you add the /u pattern modifier.

Further reading

Related articles at NadeauSoftware.com

Web articles and specifications

Comments

how to convert Islāmābād to islamabad

how to convert Islāmābād to islamabad
Islāmābād contains special characters and i am using this in some CURL call and this is giving me an error.

URLs with non-ASCII characters

The short answer: The "Islāmābād" string uses illegal characters for a URL. Convert your URL to UTF-8 using iconv( ) then percent-encode using rawurlencode( ).

The URL specification RFC3986 requires that all URL characters be from a limited subset of the ASCII character set:

a-z A-Z 0-9 - . _ ~ : / ? # [ ] @ ! $ & ' ( ) * + , ; =

Spaces, and some ASCII punctuation characters are not allowed, including:

{ } < > | \ ~ ` "

To include the above ASCII punctuation characters, or characters from the Latin-1 character set, you can use percent-encoding. This replaces each punctuation or Latin-1 single-byte character with a % sign and a 2-digit hex code for the character. For instance, a space character becomes %20.

You can percent-encode a URL using PHP's standard rawurlencode( ) function. Just pass it the URL and use the returned result. But be sure that you don't use the similar urlencode( ) function. It does not conform to the URL specification and can produce incorrect results on some URLs.

If you have characters beyond ASCII and Latin-1, the RFC2718 specification recommends that characters be encoded first as Unicode's UTF-8. This represents each non-Latin-1 character with two or more bytes. Then percent-encode that UTF-8 string to create the final URL.

Putting this all together:

  1. Encode your URL as UTF-8, if it isn't already. See PHP's iconv( ) function.
  2. Encode your UTF-8 URL with percent-encoding. See PHP's rawurlencode( ) function.
  3. Pass that URL to CURL.

Finally, beware your original idea of converting "Islāmābād" to "Islamabad". Converting a character to an acceptable equivalent requires language-specific knowledge. It isn't necessarily as simple as replacing "Ä" with "a". If you do it wrong, you can create a completely different word. Go with the UTF-8 and percent-encoding solution first.

If you'd like to learn more about URL character sets and parsing, see my article PHP tip: How to parse and build URLs.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting