HTML entities encode special characters and symbols, such as
€ for €, or
© for ©. When building a PHP search engine or web page analysis tool, HTML entities within a page must be decoded into single characters to get clean parsable text. PHP’s standard
html_entity_decode() function will do the job, but you must use a rich character encoding, such as UTF-8, and multibyte character strings. This tip shows how.
Table of Contents
- Getting multibyte character string support in PHP
- Decoding HTML character references into multibyte character strings
- Handling multibyte character strings
- Further reading
This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, get the page's content type, convert to UTF-8, strip away HTML tags, punctuation, symbol characters, and numbers, and break a page down into a keyword list.
HTML's character reference syntax enables a web page to use special characters that aren't supported by the page's normal character encoding. For example, if you use the ISO_8859-1 Latin-1 encoding common for English, German, Spanish, and other European languages, then the encoding only supports 191 printable characters. That gets you basic punctuation, numbers, and letters with and without some diacritical marks. But there is no character for the Thai Baht currency ฿, the trademark symbol ™, a right arrow →, the Greek letter PI π, the planet Mercury ☿, or a check mark ✓. To add these or thousands of other symbols to a page, you either need to switch to a richer character encoding, like UTF-8, or use HTML character references.
There are three forms for an HTML character reference:
- Name. Enter an ampersand, the name of a character, and a semi-colon. Example: © gives ©
- Decimal. Enter an ampersand, a #, the decimal number of a Unicode character, and a semi-colon. Example: © gives ©
- Hexadecimal. Enter an ampersand, a #, an x, the hexadecimal number of a Unicode character, and a semi-colon. Example: © gives ©
The named form of a character reference is called an HTML entity. There are few hundred of these entities defined in HTML 4. But there are over 100,000 characters in Unicode available with decimal or hexadecimal entities. However, font support for all of these characters is still incomplete on today's Windows, Mac, and Linux systems. Before you use Unicode characters on a web page, do some testing on different browsers for different operating systems.
To do text processing on a web page, you need to convert HTML entities and numeric character references into normal characters. You'll need to use a Unicode-based encoding, such as UTF-8 or UTF-16, that's capable of representing all of these characters. PHP has two functions that can decode character references into Unicode characters:
mb_convert_encoding(). Both are easy to use.
html_entity_decode() function converts HTML character references into characters:
$utf8_text = html_entity_decode( $text, ENT_QUOTES, "utf-8" );
The 1st argument is the text string to decode. The decoded version of the string is returned.
The 2nd argument tells the function how to treat quotes. Use
ENT_QUOTES to convert HTML entity single and double quotes back to normal quote characters.
The 3rd argument selects the character set to decode into. The argument is optional and it defaults to "
ISO_8859-1" (Latin-1). However, with this default,
html_entity_decode() only decodes character references with Latin-1 equivalents. As I noted above, there are only 191 printable Latin-1 characters. All other character references are left undecoded, which can confuse further text processing. To decode all HTML character references you must use a Unicode encoding like "
utf-8" as the 3rd argument.
mb_convert_encoding() function is one of many multibyte character functions in PHP. It can be used to convert between about 60 different encodings. One such encoding is for HTML entities and character references:
$utf8_text = mb_convert_encoding( $text, "utf-8", "HTML-ENTITIES" );
The 1st argument is the text string to decode, and the decoded version is returned.
The 2nd argument is the encoding to return, such as "utf-8".
The 3rd argument selects the encoding to convert from. In this case, "HTML-ENTITIES" tells the function to convert HTML entities into UTF-8 characters.
This function works well for named HTML entities and decimal character references. Unfortunately, as of PHP 5.2,
mb_convert_encoding() has a known bug that improperly converts hexadecimal HTML entities into nonsense characters. Until this is fixed and widely available, use
Read an HTML file, convert to UTF-8, strip out tags, and decode HTML entities:
/* Read an HTML file */ $raw_text = file_get_contents( $filename ); /* Get the file's character encoding from a <meta> tag */ preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
$raw_Text, $matches ); $encoding = $matches; /* Convert to UTF-8 before doing anything else */ $utf8_text = iconv( $encoding, "utf-8", $raw_text ); /* Strip HTML tags */ $utf8_text = strip_tags( $utf8_text ); /* Decode HTML entities */ $utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "utf-8" );
Every web page uses a "character encoding" that can be specified in a web server's response header or in a
<meta> tag at the top of the HTML file. In the past, the old ASCII or Latin-1 encodings were widely used. Both of these represent characters with a single byte. This supports 256 different characters, but some of these are reserved for unprintable "control" characters. In ASCII, this only leaves 95 printable characters, and Latin-1 has only 191. This is far too few to represent the many thousands of different letters, digits, and punctuation symbols in the world's languages. There are also character encodings for Chinese, Japanese, and others that use multiple bytes per character, but they too are limited to only those symbols used by the language.
Ideally, everybody will convert their web pages to use the generic UTF-8 character encoding, which can directly support over 100,000 characters in the international Unicode character set. This is slowly happening, but there are still millions of legacy pages using language-specific encodings. And many web authoring tools that support UTF-8 still default to some other encoding. Unless web authors know to change their settings to UTF-8, they continue to create non-UTF-8 pages. And as long as this is the case, HTML character references will be embedded in those pages whenever the author needs to use special characters.
If you're processing HTML text to build a keyword list for a search engine, or filter text for spam and bad language, you'll need to convert character references into normal characters first. If you don't, your text parser will see © as six separate letters instead of the single © copyright symbol.
There are three parts to handling HTML character references:
- Enable PHP's multibyte string extension.
- Decode HTML entities into UTF-8.
- Use multibyte string functions for further processing of the decoded string.
Getting multibyte character string support in PHP
The Multibyte String Extension is available for PHP 4.3 and later. It is essential for handling international text.
This extension must be enabled when compiling the PHP engine. Fortunately, most PHP distributions come with this extension already enabled. You can check your installation by running the
phpinfo() function and looking in the
Configure Command section at the top of its output for
The extension has several configuration options that you can set in your php.ini file. One feature enables the extension to be set to automatically convert between encodings. However, this can be dangerous because it presumes that all text should be converted in the same way, and without intervention by your own code. It is usually best to leave the extension's automatic conversions disabled. This is the default.
Decoding HTML character references into multibyte character strings
html_entity_decode() function uses a translation table to convert HTML named entities and character references into returned characters. The 3rd argument to the function selects the translation table to use. If you use the default Latin-1 table, multibyte character entities will not be translated. To decode all HTML entities and generate multibyte characters, you have to decode into a richer character encoding, like
$utf8_text = html_entity_decode( $text, ENT_QUOTES, "utf-8" );
As I noted earlier, PHP's
mb_convert_encoding() function can do the same thing. Unfortunately, as of PHP 5.2 it has a bug that prevents it from converting hexadecimal character references. Since these are quite common, it currently isn't safe to use
If the page includes HTML tags, strip them out before decoding HTML entities. Decoding some entities, such as
>, generates characters that will confuse tag parsing.
Handling multibyte character strings
html_entity_decode() function's returned string will include multibyte characters if any HTML entities needed them. Many standard PHP functions are not multibyte character-aware. The
strlen() function, for instance, actually counts bytes, not characters, and it will return a wrong answer when used with multibyte characters.
To process multibyte characters, you must use multibyte string functions, such as
mb_strlen(). There are multibyte character equivalents for many of PHP's string functions. When using these multibyte string functions, remember to pass "
utf-8" as the encoding type on every function call. There is nothing in the string itself to indicate the encoding it is using. If you aren't explicit, the functions will use the default PHP internal encoding, which may not be UTF-8. You could get garbled text.
While there are no multibyte character equivalents for the
preg_* functions, these functions can handle UTF-8 on their own if you add the
/u pattern modifier.