PHP tip: How to strip symbol characters from a web page

Technologies: PHP 4.3+, UTF-8

Most symbol characters, like + = © ™ ← → ☺ ♣ ♠, need to be stripped out of web page text before processing it in a search engine or text analysis tool. For international text there are thousands of symbol characters, but some should be removed in one context, but not in another. This tip shows how.

This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, get the page's content type, convert to UTF-8, strip away HTML syntax, punctuation, and numbers, and break a page down into a keyword list.

Introduction

For many text analysis tasks, the first steps remove unneeded elements from a web page. This includes stripping HTML tags, scripts, and styles, stripping punctuation characters (such as ! ? & ( ) # @), and sometimes stripping numbers. Stripping symbols is another useful step:

  • Search indexing: People don't search on mathematics, dentistry, or astrology symbols. Instead of the astrology symbol ♋ they'll type the word "cancer". Instead of the clubs card suit symbol ♣ they'll type "clubs". This is probably an artifact of the difficulties in typing these special symbols. But, if people don't search on symbols, there is no need to add them to a search index. Remove them before indexing the page.
  • Keyword extraction: The most important and frequently used words on a page give a rough idea of the page's topic (often used in tag clouds). Smiley faces, arrows, and line drawing characters are fun, but the page's words convey more meaning. Remove the symbols before keyword extraction.
  • Page statistics: Copyright, trademark, and paragraph symbols don't count when calculating the length of a document, its writing complexity, or the grade level of its vocabulary. Remove them first.

Removing symbols simplifies text processing, but it also degrades the results. Removing currency symbols makes prices ambiguous. Removing plus and minus symbols makes balance sheets harder to figure out. Removing hearts, clubs, diamonds, and spades symbols makes card game instructions confusing. And removing mathematics symbols from an algebra lesson leaves a meaningless list of variables.

For tasks that need to remove symbols, the regular expressions in this article do a reasonable job. They keep currency symbols and other number modifiers, but they remove most of the rest. If your task involves specialized content, such as card game instructions, then you'll need to modify this code to exclude more symbols.

Code

The following function uses preg_replace() and Unicode (encoded with UTF-8) to strip symbol characters from international text while handling special cases. The regular expressions are explained later in this article. This function's only argument is the UTF-8 text to strip. The stripped text is returned.

Download: strip_symbols.zip.

/**
 * Strip symbols from text.
 */
function strip_symbols( $text )
{
    $plus   = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}';
    $minus  = '\x{2012}\x{208B}\x{207B}';
 
    $units  = '\\x{00B0}\x{2103}\x{2109}\\x{23CD}';
    $units .= '\\x{32CC}-\\x{32CE}';
    $units .= '\\x{3300}-\\x{3357}';
    $units .= '\\x{3371}-\\x{33DF}';
    $units .= '\\x{33FF}';
 
    $ideo   = '\\x{2E80}-\\x{2EF3}';
    $ideo  .= '\\x{2F00}-\\x{2FD5}';
    $ideo  .= '\\x{2FF0}-\\x{2FFB}';
    $ideo  .= '\\x{3037}-\\x{303F}';
    $ideo  .= '\\x{3190}-\\x{319F}';
    $ideo  .= '\\x{31C0}-\\x{31CF}';
    $ideo  .= '\\x{32C0}-\\x{32CB}';
    $ideo  .= '\\x{3358}-\\x{3370}';
    $ideo  .= '\\x{33E0}-\\x{33FE}';
    $ideo  .= '\\x{A490}-\\x{A4C6}';
 
    return preg_replace(
        array(
        // Remove modifier and private use symbols.
            '/[\p{Sk}\p{Co}]/u',
        // Remove mathematics symbols except + - = ~ and fraction slash
            '/\p{Sm}(?<![' . $plus . $minus . '=~\x{2044}])/u',
        // Remove + - if space before, no number or currency after
            '/((?<= )|^)[' . $plus . $minus . ']+((?![\p{N}\p{Sc}])|$)/u',
        // Remove = if space before
            '/((?<= )|^)=+/u',
        // Remove + - = ~ if space after
            '/[' . $plus . $minus . '=~]+((?= )|$)/u',
        // Remove other symbols except units and ideograph parts
            '/\p{So}(?<![' . $units . $ideo . '])/u',
        // Remove consecutive white space
            '/ +/',
        ),
        ' ',
        $text );
}

Example

Read an HTML file, convert to UTF-8, remove HTML tags, decode HTML entities into UTF-8, and strip out symbols:

/* Read an HTML file */
$raw_text = file_get_contents( $filename );
 
/* Get the file's character encoding from a <meta> tag */
preg_match( '@<meta\s+http-equiv="Content-Type"\s+' .
    'content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
$raw_Text, $matches ); $encoding = $matches[3]; /* Convert to UTF-8 before doing anything else */ $utf8_text = iconv( $encoding, "utf-8", $raw_text ); /* Strip HTML tags */ $utf8_text = strip_tags( $utf8_text ); /* Decode HTML entities */ $utf8_text = html_entity_decode( $utf8_text, ENT_NOQUOTES, "UTF-8" ); /* Remove symbols */ $utf8_text = strip_symbols( $utf8_text );

On this input (some older browsers will not display all of these characters properly):

Keep currency symbols $ ¢ £ € ¥.
Remove modifier symbols ` ^ ˚ ˘ ¯.
Remove common mathematics symbols |a + b| = c < d > e ÷ f × g.
Remove more mathematics symbols ∀ E  ∈ R ≠ ∅ ⊄ I.
Keep minus in -$123 $-123 and plus in +$123 $+123 and +123.4e-5.67.
Keep fraction slash in 1⁄2.
Keep URLs http://example.com/~dave/a+b=c and ~user/directories.
Remove other symbols © ® ™ ℀ ⅊ ¶ ℞ ♥ ♣ ♬.
Keep degrees in 0℃ = 32℉ and units 100㎏ and 20,000㎐
Keep CJK radials and strokes ⺁ ⺃ ⺄ ⺅.

Generates this output:

Keep currency symbols $ ¢ £ € ¥.
Remove modifier symbols .
Remove common mathematics symbols a b c d e f g.
Remove more mathematics symbols E R I.
Keep minus in -$123 $-123 and plus in +$123 $+123 and +123.4e-5.67.
Keep fraction slash in 1⁄2.
Keep URLs http://example.com/~dave/a+b=c and ~user/directories.
Remove other symbols .
Keep degrees in 0℃ 32℉ and units 100㎏ and 20,000㎐
Keep CJK radials and strokes ⺁ ⺃ ⺄ ⺅.

Explanation

The old ASCII character encoding supported 128 standard characters and only a few symbols, such as + = $ < >. This was sufficient for basic English text, but lacked the symbols needed for other languages and mathematics, science, and other fields. Unicode addresses this by supporting over 100,000 characters spanning all of the world's languages. Along with international letters and punctuation, Unicode adds thousands of symbols for general use (such as © ® ™ § ¶), currencies (such as $ ¢ £ ¥ €), mathematics (such as ± ÷ ⅀ ∅ ∀ ∞), astrology (such as ☿ ♈ ♋ ♏), dentistry (such as ⏀ ⏄ ⏆), braille (such as ⠔ ⠕ ⠖ ⠗), box drawings (such as ┏ ┣ ┫ ═ ║ ╡), games (such as ♣ ♠ ♡ ♢ ♚ ♞), horizontal and vertical bars (such as ▎ ▌ ▊ █ ▆ ▄ ▂), and other uses (such as ← → ↑ ↓ ↩ ↪ ☂ ☎ ☛ ☯ ☺).

Every character in Unicode is categorized as a letter, number, mark, punctuation, symbol, separator, or other character. To remove all symbols on a web page, a regular expression could just delete all characters in the Unicode symbol category.

However, there are a few special cases to handle. For instance, to leave prices and other numbers intact, symbol removal needs to skip past currency symbols and the plus and minus signs. For East Asian languages, processing needs to skip past radical and stroke symbols used to assemble ideographs. There are also a few characters outside of the Unicode symbol category that need special handling.

Unicode character categories

PHP's standard preg_replace() function supports regular expressions and the /u pattern modifier to match characters based upon their Unicode category. To match a character in a category, start with \p{ followed by a category code and }. For instance, \p{Sm} matches any Unicode mathematics symbol and \p{Sc} matches any currency symbol.

Below are all 30 Unicode categories, their codes for regular expressions, and a few examples. The symbol categories are the most relevant for symbol character removal. There are also a few characters in the punctuation and other categories that are of interest.

Unicode 'Letter' category
Code Name Examples
Ll Letter, lowercase a b ç ď ĕ ʑ ʘ π й
Lm Letter, modifier ˇ ˆ ๆ ゞ
Lo Letter, other א ก あ ア ꀀ 豈
Lt Letter, titlecase Dž ᾈ ᾨ
Lu Letter, uppercase Æ Δ Ω Ж Ç
Unicode 'Mark' category
Code Name Examples
Mc Mark, spacing combining ூ ௗ ཿ
Me Mark, enclosing ۞ ⃟ ⃞
Mn Mark, nonspacing ̺ ۖ ཹ
Unicode 'Number' category
Code Name Examples
Nd Number, decimal digits 0 1 2 3 4 5 6 7 8 9 ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ h ٩
Nl Number, letter Ⅰ Ⅱ Ⅲ Ⅳ Ⅼ Ⅿ 〸 〹 〺
No Number, other ¼ ½ ¾ ¹ ³ ² ₄ ₅ ₆ ⑦ ❽ ⑼
Unicode 'Punctuation' category
Code Name Examples
Pc Punctuation, connector _ ‿ ⁀ ⁔ ﹎ ﹏
Pd Punctuation, dash - – — 〜 〰
Pe Punctuation, close ) ] } ⁆ ❱ ﴿ ︶ ︾ 」
Pf Punctuation, final quote » ’ ” ›
Pi Punctuation, initial quote « ‘ “ ‹
Po Punctuation, other ' " # % & ! . : , ? ¿
Ps Punctuation, open ( [ { ⁅ ❰ ﴾ ︿ ︽ 「
Unicode 'Symbol' category
Code Name Examples
Sc Symbol, currency $ ¢ £ € ¥
Sk Symbol, modifier ^ ` ´
Sm Symbol, mathematics + = < >
So Symbol, other § © ® ¶
Unicode 'Separator' category
Code Name Examples
Zl Separator, line  
Zp Separator, paragraph  
Zs Separator, space space, en space, em space
Unicode 'Other' category
Code Name Examples
Cc Other, control tab, linefeed, carriage return
Cf Other, format  
Cn Other, not assigned  
Co Other, private use Apple logo
Cs Other, surrogate  

Unicode.org has definitive information about Unicode, including Unicode code charts listing all of Unicode's characters. However, FileFormat.info has more user-friendly Unicode information, including Unicode Character Categories listing all 30 categories and links to lists of characters within them. Wikipedia also many good articles on Unicode, including Mapping of Unicode characters and Unicode symbols.

Removing currency symbols — not

Currency symbols act as number modifiers, adding meaning to otherwise bare digits. Depending upon national conventions, a currency's symbol may come immediately before or after a number, such as $12 and 34¢. If text processing leaves numbers in the text, then currency symbols should be left in the text as well.

This article's code does not remove currency symbols. Instead, see the article on How to strip numbers from web page text, which provides code to strip out currency symbols along with the numbers they modify.

Removing modifier symbols

Unicode has 99 different modifier symbols used to augment other symbols. For example, arrow head modifiers can be added to line drawing symbols to create arrows. Remove all of these modifiers by matching \p{Sk}.

// Remove modifier symbols
$text = preg_replace( '/\p{Sk}/u', ' ', $text );

Limitations: The "Grave accent" (`) is technically a modifier symbol, but it is sometimes misused as a left quote `like this` (such as in programming languages like Perl). In this use, the accent is a punctuation mark, not a symbol, and it is incorrectly removed by the above expression.

Removing mathematical symbols

Unicode defines 914 mathematical symbols, including the familiar + = < > ± ÷ and a huge number of special mathematical symbols like ∞ ∑ ∅ ∀ ∴ ∪ ∩ ∫. All of them are matched by \p{Sm}. Wikipedia has a good reference table of mathematical symbols and what they mean, and a basic article on Unicode mathematical operators.

During symbol removal, all of these mathematical symbols can be removed with a few exceptions:

  • The plus and minus signs are number modifiers. To keep numbers intact, these characters should be removed if followed by a space, or when preceded by a space and not immediately followed by a number (\p{N}) or currency symbol (\p{Sc}), as in −$123.45 and $-123.45.
  • The fraction slash (solidus) (which looks like a slash, but isn't) is used to separate the numerator and denominator in a fraction, such as 1⁄2. To keep fractions in the text, fraction slashes should not be removed.
  • The plus, equals, and tilde symbols can be used in URLs. To avoid fragmenting URLs, these characters should be removed only when not embedded within text. Remove equals if preceded or followed by a space, and plus and tilde if followed by a space. For tilde, this will retain its use when referring to Linux and Mac OS X home directories, such as ~user/Desktop.

Unicode details to note:

  • There are five different Unicode plus symbols of various widths and one each as a subscript and superscript: + ﹢ + ₊ ⁺.
  • There are three minus signs with one at a normal size and one each as a subscript and superscript: ‒ ₋ ⁻. The dash character found in ASCII and on standard keyboards is widely misused as a minus sign, but typographically it is punctuation. The dash character is not matched by \p{Sm} below, while the three minus symbols are.
$plus  = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}';
$minus = '\x{2012}\x{208B}\x{207B}'; // Remove mathematics symbols except + - = ~ and fraction slash $text = preg_replace( '/\p{Sm}(?<![' . $plus . $minus . '=~\x{2044}])/u', ' ', $text ); // Remove + - if space before and not number or currency after $text = preg_replace( '/((?<= )|^)[' . $plus . $minus . ']+((?![\p{N}\p{Sc}])|$)/u', ' ', $text ); // Remove = if space before $text = preg_replace( '/((?<= )|^)=+)u', ' ', $text ); // Remove + - = ~ if space after $text = preg_replace( '/[' . $plus . $minus . '=~]+((?= )|$)/u', ' ', $text );

Limitations: This code removes < and > characters. Be sure to strip out HTML and XML tags before stripping out symbols or it will be impossible to remove the tags afterwards. This code also removes + = ~ at the end of URLs, which is a legal use, but very rare.

Removing other symbols

There are 2,958 more symbol characters in Unicode's other symbol category matched by \p{So}. These include symbols like the copyright and trademark signs, as well as arrows, drawing characters, smiley faces, astrology symbols, and many many more. With a few exceptions, all of these can be removed.

Special handling is needed to keep symbols that are number modifiers:

  • Symbols for temperatures in Celsius and Fahrenheit.
  • Symbols for angular degrees.
  • Symbols for units of measure, such as ㎧ ㎨ ㎡ ㏈ ㏕ ㎷.

The other symbols category also contains a large number of special radical and stroke characters use for East Asian languages. While Unicode strives to include specific characters for every ideograph in Chinese, Japanese, Korean, and others, the radical and stroke symbols enable missing characters to be built up one stroke at a time. Since these characters build words, they should not be removed during symbol removal.

There are a lot of symbols to handle specially. Instead of listing them individually, a regular expression can refer to them as blocks of consecutive characters:

  • Units of measure:
    • \x{00B0} for the degree symbol.
    • \x{2103} for the degree celsius symbol.
    • \x{2109} for the degree fahrenheit symbol.
    • \x{23CD} for the square foot symbol.
    • \x{32CC} to \x{32CE} for units symbols.
    • \x{3300} to \x{3357} for ideographic units symbols.
    • \x{3371} to \x{33DF} for more units symbols.
    • \x{33FF} for the gallon symbol.
  • Ideograph radicals, strokes, symbols, and descriptors:
    • \x{2E80} to \x{2EF3} for CJK radicals.
    • \x{2F00} to \x{2FD5} for KangXI radicals.
    • \x{2FF0} to \x{2FFB} for ideographic descriptors.
    • \x{3037} to \x{303F} for miscellaneous ideographic indicators.
    • \x{3190} to \x{319F} for ideographic annotation marks.
    • \x{31C0} to \x{31CF} for CJK strokes.
    • \x{32C0} to \x{32CB} for ideograph month symbols.
    • \x{3358} to \x{3370} for ideograph time symbols.
    • \x{33E0} to \x{33FE} for ideograph day symbols.
    • \x{A490} to \x{A4C6} for YI radicals.
// Remove other symbols except units of measure and ideograph parts
$units  = '\x{00B0}\x{2103}\x{2109}\x{23CD}';
$units .= '\x{32CC}-\x{32CE}';
$units .= '\x{3300}-\x{3357}';
$units .= '\x{3371}-\x{33DF}';
$units .= '\x{33FF}';
 
$ideo  = '\x{2E80}-\x{2EF3}';
$ideo .= '\x{2F00}-\x{2FD5}';
$ideo .= '\x{2FF0}-\x{2FFB}';
$ideo .= '\x{3037}-\x{303F}';
$ideo .= '\x{3190}-\x{319F}';
$ideo .= '\x{31C0}-\x{31CF}';
$ideo .= '\x{32D0}-\x{32FE}';
$ideo .= '\x{3358}-\x{3370}';
$ideo .= '\x{33E0}-\x{33FE}';
$ideo .= '\x{A490}-\x{A4C6}';
 
$text = preg_replace( '/\p{So}(?<![' . $units . $ideo . '])/u',
    ' ', $text );

Limitations: Many symbol characters are abbreviations, such as ℅ for "Care of", or shorthands, such as © for "Copyright". Instead of simply removing them, some suggest that these should be replaced with the words they stand for. There are also compound symbols, like , that could be replaced by their component characters, like m, /, and s or even meters per second. Perhaps the astrology and planet symbols, like ♋ and ♁, could be replaced with their corresponding words, like Cancer and Earth, respectively. And what about replacing Braille dot patterns with words? Or replacing the East Asian language radicals and strokes with their component meanings?

But this has now crossed the line into language translation. Many of these symbols have meanings that are language-specific and that change based upon their context. Abbreviations and compound symbols may need to expand into different words in different languages.

For purposes of this article, and search indexing, it is sufficient to simply remove most of these symbols. They are usually in support of the core meaning of the words around them, so removing them and focusing upon those neighboring words should do well enough.

Removing private use symbols

The private use characters in Unicode's other category are all vendor- or task-specific symbols, such as Apple's logo. Remove these by matching \p{Co}.

// Remove other private use characters
$text = preg_replace( '/\p{Co}/u', ' ', $text );

Removing consecutive spaces

Each of the above expressions replaces symbols with spaces. This avoids joining together words adjacent to the symbols, but it can leave multiple consecutive spaces. To clean up, remove them.

// Remove consecutive spaces
$text = preg_replace( '/ +/', ' ', $text);

Other issues

For this to work reliably:

  • Before removing symbols, use the web page's content type to get its character set, then convert to UTF-8 using the iconv() function. This insures that the text is in the UTF-8 character encoding. Running preg_replace() with the /u pattern modifier on non-UTF-8 text sometimes causes the function to abort and return an empty string or the original text unprocessed.
  • Strip HTML tags and decode HTML entities first. This gives you pure UTF-8 text with all HTML-specific symbols already removed.

Downloads

Further reading

Related articles at NadeauSoftware.com

Web articles and specifications

Comments

Thanks

UTF8 character encoding can be a pain, this is a great article, really good quality, thanks for taking the time to write it!

Nice

Great article and the script was great.. already used it in my kiosk application! great work!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting