PHP tip: How to strip numbers from a web page

Technologies: PHP 4.3+, UTF-8

Numbers in prices, quantities, dates, times, phone numbers, and addresses may not be of interest when processing a web page for a PHP search engine or keyword analysis tool. In international text there are around 900 different types of digits, currency symbols, and units of measure marks that need to be removed. This tip shows how to remove numbers and number-related characters.

This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, get the page's content type, convert to UTF-8, strip away HTML syntax, punctuation, and symbol characters, and break a page down into a keyword list.

Introduction

Text processing starts by removing the least important parts of a web page. This may include stripping HTML tags, scripts, and styles, stripping punctuation characters (such as ! ? & ( ) # @), and stripping symbol characters (such as smileys, arrows, and mathematics symbols). This only leaves words and numbers. Stripping out numbers is also useful:

  • Search indexing: Numbers make poor search terms. Literal searching on "12.30" won't find "12.3", despite identical meanings. If people don't use numbers in searches, then remove them from web page text before adding the remaining content to the search index. This speeds up page indexing and reduces the storage needed for the index.
  • Keyword extraction: The most important and frequently used words on a page give a rough idea of the page's topic (often used in tag clouds). On a financial report, the numbers add valuable detail, but in everyday text numbers are secondary and can be safely removed during keyword extraction.
  • Page statistics: Numbers aren't useful when computing the grade level of a page's vocabulary. They may add to the word count of a page, but they are probably not useful in estimating the page's writing complexity. Remove the numbers first.

Removing numbers simplifies text processing, but it can remove necessary detail. For a web page listing CDs for sale, removing its numbers strips away release dates, track lengths, and prices. What's left may not be very useful. Similarly, removing numbers from a financial report, an algebra lesson, or a list of IP addresses rather damages the page.

For those tasks where number removal is desirable, the regular expressions in this article do a reasonable job. They remove digits, decimal and thousands separators, plus and minus signs, and currency symbols.

Code

The following function uses preg_replace() and Unicode (encoded with UTF-8) to strip numbers and number-related characters from international text. The regular expressions are explained later in this article. This function's only argument is the UTF-8 text to strip. The stripped text is returned.

Download: strip_numbers.zip.

/**
 * Strip numbers from text.
 */
function strip_numbers( $text )
{
    $urlchars      = '\.,:;\'=+\-_\*%@&\/\\\\?!#~\[\]\(\)';
    $notdelim      = '\p{L}\p{M}\p{N}\p{Pc}\p{Pd}' . $urlchars;
    $predelim      = '((?<=[^' . $notdelim . '])|^)';
    $postdelim     = '((?=[^'  . $notdelim . '])|$)';
 
    $fullstop      = '\x{002E}\x{FE52}\x{FF0E}';
    $comma         = '\x{002C}\x{FE50}\x{FF0C}';
    $arabsep       = '\x{066B}\x{066C}';
    $numseparators = $fullstop . $comma . $arabsep;
    $plus          = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}';
    $minus         = '\x{2212}\x{208B}\x{207B}\p{Pd}';
    $slash         = '[\/\x{2044}]';
    $colon         = ':\x{FE55}\x{FF1A}\x{2236}';
    $units         = '%\x{FF05}\x{FE64}\x{2030}\x{2031}';
    $units        .= '\x{00B0}\x{2103}\x{2109}\x{23CD}';
    $units        .= '\x{32CC}-\x{32CE}';
    $units        .= '\x{3300}-\x{3357}';
    $units        .= '\x{3371}-\x{33DF}';
    $units        .= '\x{33FF}';
    $percents      = '%\x{FE64}\x{FF05}\x{2030}\x{2031}';
    $ampm          = '([aApP][mM])';
 
    $digits        = '[\p{N}' . $numseparators . ']+';
    $sign          = '[' . $plus . $minus . ']?';
    $exponent      = '([eE]' . $sign . $digits . ')?';
    $prenum        = $sign . '[\p{Sc}#]?' . $sign;
    $postnum       = '([\p{Sc}' . $units . $percents . ']|' . $ampm . ')?';
    $number        = $prenum . $digits . $exponent . $postnum;
    $fraction      = $number . '(' . $slash . $number . ')?';
    $numpair       = $fraction . '([' . $minus . $colon . $fullstop . ']' .
        $fraction . ')*';
 
    return preg_replace(
        array(
        // Match delimited numbers
            '/' . $predelim . $numpair . $postdelim . '/u',
        // Match consecutive white space
            '/ +/u',
        ),
        ' ',
        $text );
}

Example

Read an HTML file, convert to UTF-8, remove HTML tags, decode HTML entities into UTF-8, and strip out numbers:

/* Read an HTML file */
$raw_text = file_get_contents( $filename );

/* Get the file's character encoding from a <meta> tag */
preg_match( '@<meta\s+http-equiv="Content-Type"\s+' . 'content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
$raw_Text, $matches );
$encoding = $matches[3];

/* Convert to UTF-8 before doing anything else */
$utf8_text = iconv( $encoding, "utf-8", $raw_text ); /* Strip HTML tags */ $utf8_text = strip_tags( $utf8_text ); /* Decode HTML entities */ $utf8_text = html_entity_decode( $utf8_text, ENT_NOQUOTES, "UTF-8" ); /* Remove numbers */ $utf8_text = strip_numbers( $utf8_text );

On this input:

Remove standalone 1 23.4 5,678.9 and malformed numbers 1...,2.3,,,,45..
Remove IP addresses 127.0.0.1.
Keep commas, and full stops in sentences.  And domain.com names.
Keep numbers in URLs http://site5.example.com:80/get?a=(123)&q=45.
Remove signs -12 +34 and exponents -1.2e-34.5 +6.78E90.
Keep hyphens and non-number dashes - like this - or -- this.
Remove fractions 1/2 -3/4 5/-6 and ranges 2006-2007 -1--2.
Keep slashes in /file/names.txt.
Remove ratios 1:2 -2:-3 1/2:3/4 and phone numbers 1-800-555-1234
Keep colons used like: that, or http://this.com:80.
Remove number signs #1 and times 12:00am 4:30PM-5:00pm
Keep # without a number and am or pm.
Remove percents 10% per-mille 10‰ and per-ten-thousand 10%
Keep percents in URLs http://example.com/a%20space.txt.
Remove units symbols 12㎧ 32℉
Remove currency symbols $1 2¢ 3.4€
Keep stand-alone $ and € signs.

Generates this output:

Remove standalone and malformed numbers 
Remove IP addresses 
Keep commas, and full stops in sentences. And domain.com names.
Keep numbers in URLs http://site5.example.com:80/get?a=(123)&q=45.
Remove signs and exponents 
Keep hyphens and non-number dashes - like this - or -- this.
Remove fractions and ranges 
Keep slashes in /file/names.txt.
Remove ratios and phone numbers 
Keep colons used like: that, or http://this.com:80.
Remove number signs and times 
Keep # without a number and am or pm.
Remove percents per-mille and per-ten-thousand 
Keep percents in URLs http://example.com/a%20space.txt.
Remove units symbols 
Remove currency symbols 
Keep stand-alone $ and € signs.

Explanation

While most of the world's languages use the Arabic numerals 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9, there are additional digit symbols used in other languages, mathematics, and special cases. Unicode includes 290 such digit symbols. There are also 210 more numeric letter symbols, such as the Roman numerals Ⅰ Ⅱ Ⅲ Ⅳ Ⅼ Ⅿ, and those in Old Persian, Cuneiform, and others. Add another 336 special symbols for fractions (such as ¼ ½ ¾), superscripts (such as ¹ ³ ²), subscripts (such as ₄ ₅ ₆), circled and parenthesized digits (such as ⑦ ❽ ⑼), and more.

Numbers are more than digits. They also include number modifiers that give context to a symbol, like currency symbols (such as $ ¢ £ ¥ €), units of measure abbreviations (such as ㎓ ㎖ ㎞ ㎲ ㎧), plus, minus, percent, and so on. Numbers also come in groups, such as hours and minus in 12:00, and digit groups in a phone number, like 1-800-555-1234, or Internet address, like 127.0.0.1.

To properly remove numbers, code must consider context. A dash should be removed when used as a minus or in a numeric range, like 2006-2007, but not when used as a hyphen. A colon should be removed in a ratio, like 4:3, but not when used as a phrase delimiter in a sentence. A full stop (period) should be removed as a decimal separator in 123.45, but left alone at the end of a sentence or between words in a domain name, like example.com.

Unicode categories

PHP's standard preg_replace() function supports regular expressions and the /u pattern modifier to match characters based upon their Unicode category. To match a character in a category, start with \p{ followed by a category code and }. For instance, \p{Nd} matches any Unicode number digits and \p{Sc} matches currency symbols.

Below are all 30 Unicode categories, their codes for regular expressions, and a few examples. The number categories are the most relevant for number removal, of course, but there are also a few characters in the symbol and punctuation categories that are of interest.

Unicode 'Letter' category
Code Name Examples
Ll Letter, lowercase a b ç ď ĕ ʑ ʘ π й
Lm Letter, modifier ˇ ˆ ๆ ゞ
Lo Letter, other א ก あ ア ꀀ 豈
Lt Letter, titlecase Dž ᾈ ᾨ
Lu Letter, uppercase Æ Δ Ω Ж Ç
Unicode 'Mark' category
Code Name Examples
Mc Mark, spacing combining ூ ௗ ཿ
Me Mark, enclosing ۞ ⃟ ⃞
Mn Mark, nonspacing ̺ ۖ ཹ
Unicode 'Number' category
Code Name Examples
Nd Number, decimal digits 0 1 2 3 4 5 6 7 8 9 ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ h ٩
Nl Number, letter Ⅰ Ⅱ Ⅲ Ⅳ Ⅼ Ⅿ 〸 〹 〺
No Number, other ¼ ½ ¾ ¹ ³ ² ₄ ₅ ₆ ⑦ ❽ ⑼
Unicode 'Punctuation' category
Code Name Examples
Pc Punctuation, connector _ ‿ ⁀ ⁔ ﹎ ﹏
Pd Punctuation, dash - – — 〜 〰
Pe Punctuation, close ) ] } ⁆ ❱ ﴿ ︶ ︾ 」
Pf Punctuation, final quote » ’ ” ›
Pi Punctuation, initial quote « ‘ “ ‹
Po Punctuation, other ' " # % & ! . : , ? ¿
Ps Punctuation, open ( [ { ⁅ ❰ ﴾ ︿ ︽ 「
Unicode 'Symbol' category
Code Name Examples
Sc Symbol, currency $ ¢ £ € ¥
Sk Symbol, modifier ^ ` ´
Sm Symbol, mathematics + = < >
So Symbol, other § © ® ¶
Unicode 'Separator' category
Code Name Examples
Zl Separator, line  
Zp Separator, paragraph  
Zs Separator, space space, en space, em space
Unicode 'Other' category
Code Name Examples
Cc Other, control tab, linefeed, carriage return
Cf Other, format  
Cn Other, not assigned  
Co Other, private use Apple logo
Cs Other, surrogate  

Unicode.org has definitive information about Unicode, including Unicode code charts listing all of Unicode's characters. However, FileFormat.info has more user-friendly Unicode information, including Unicode Character Categories listing all 30 categories and links to lists of characters within them. Wikipedia also many good articles on Unicode, including Mapping of Unicode characters and Punctuation.

Removing number digits

Unicode decimal digits are matched by \p{Nd}, number letters by \p{Nl}, and other number characters by \p{No}. Characters in all three categories are matched by \p{N}.

All of these number characters can be removed for stand-alone numbers, like 123, but not when they're embedded within words, such as:

  • Names and technical terms like Lucky-7, SP1, and v1.0.
  • Domain names and email addresses like user123@site5.example.com.
  • URLs like http://b4.com:80/v1.htm.
  • File names like /usr/local/bin/php5 or C:\Program Files\Acrobat8

A removable number is a sequence of digits, full stops (periods), and commas, delimited from the surrounding text. Valid delimiters are white space (separators and control characters), and most punctuation and symbols. For the regular expression below, it is more convenient to list characters that are not number delimiters. These include letters, marks (such as accents), numbers themselves, connectors (such as underscores), dashes, and web characters like @ : / \ (see the RFC3986 Uniform Resource Identifier (URI) specification for a full list of web characters).

Unicode has several variants for decimal and thousands separator characters embedded within a number. All of these may occur in formatted text:

  Normal Small Fullwidth
Full stop (period) \x{002E} = . \x{FE52} = ﹒ \x{FF0E} = .
Comma \x{002C} = , \x{FE50} = ﹐ \x{FF0C} = ,
Arabic decimal separator \x{066B} = ٫    
Arabic thousands separator \x{066C} = ٬    
// Remove delimited numbers
$urlchars      = '\.,:;\'=+\-_\*%@&\/\\\\?!#~\[\]\(\)';
$notdelim      = '\p{L}\p{M}\p{N}\p{Pc}\p{Pd}' . $urlchars;
$predelim      = '((?<=[^' . $notdelim . '])|^)';
$postdelim     = '((?=[^'  . $notdelim . '])|$)';
 
$fullstop      = '\x{002E}\x{FE52}\x{FF0E}';
$comma         = '\x{002C}\x{FE50}\x{FF0C}';
$arabsep       = '\x{066B}\x{066C}';
$numseparators = $fullstop . $comma . $arabsep;
 
$digits        = '[\p{N}' . $numseparators . ']+';
 
$text = preg_replace( '/' . $predelim . $digits . $postdelim . '/u',
    ' ', $text );

Limitations: Several punctuation characters in URLs also occur often in normal text, such as full stop (period), comma, colon, and parenthesis. To keep URLs intact, these characters are not number delimiters. While this will keep numbers in a URL like http://example.com/123:4,5(6).txt, it also prevents their removal when immediately preceded or followed by these punctuation characters. For instance, any number in parenthesis, like (123), or at the end of a sentence is not removed, like 123. A fix for this is to remove punctuation used in a non-URL context first, but that is beyond the scope of this article. Instead, see the article on stripping punctuation characters and use its code first, before removing numbers.

Also, hex numbers are not removed. This would require extending the digit list to include A-F, but that would also remove non-hex words that happen to use the same letters, such as "Bee", "Cab", "Feed", etc. It is probably possible to extend this expression to support specific forms of hex numbers, such as #ABC or %xAB or &#xABCD.

 

Removing signs and exponents

A plus or minus sign may precede a number. An exponent in scientific notation may follow a number, as in -1.2e-34. The number and exponent each may have a plus or minus sign.

Unicode includes five different plus signs: normal, small, fullwidth, subscript, and superscript. All of them should be removed when followed by a number.

Unicode also includes three minus signs: normal, subscript, and superscript. Technically, a dash is not a minus sign, but a dash is easier to type on a standard keyboard. All 18 dash symbols, including long and short dashes, can be matched as a category with \p{Pd}. These dashes, and the minus signs, should be removed when followed by a number.

  Normal Small Fullwidth Subscript Superscript
Plus \x{002B} = + \x{FE62} = ﹢ \x{FF0B} = + \x{208A} = ₊ \x{207A} = ⁺
Minus \x{2212} = − \x{208B} = ₋ \x{207B} = ⁻

The expression below extends the previous one to now include numbers with optional signs and exponents:

// Remove delimited numbers with signs and exponents
$urlchars      = '\.,:;\'=+\-_\*%@&\/\\\\?!#~\[\]\(\)';
$notdelim      = '\p{L}\p{M}\p{N}\p{Pc}\p{Pd}' . $urlchars;
$predelim      = '((?<=[^' . $notdelim . '])|^)';
$postdelim     = '((?=[^'  . $notdelim . '])|$)';
 
$fullstop      = '\x{002E}\x{FE52}\x{FF0E}';
$comma         = '\x{002C}\x{FE50}\x{FF0C}';
$arabsep       = '\x{066B}\x{066C}';
$numseparators = $fullstop . $comma . $arabsep;
$plus          = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}';
$minus         = '\x{2212}\x{208B}\x{207B}\p{Pd}';
  
$digits        = '[\p{N}' . $numseparators . ']+';
$sign          = '[' . $plus . $minus . ']?';
$exponent      = '([eE]' . $sign . $digits . ')?';
$number        = $sign . $digits . $exponent;
 
$text = preg_replace( '/' . $predelim . $number . $postdelim . '/u',
    ' ', $text );

Removing fractions

A fraction is represented as a number, a slash, and another number. Either or both numbers may have a sign or exponent. While a simple slash character is often used, Unicode also defines a special "fraction slash" (\x{2044}) that looks the same but clarifies the semantics.

The expression below extends the previous one to include single numbers and a pair of numbers in a fraction:

// Remove delimited numbers with signs, exponents, and fractions
$urlchars      = '\.,:;\'=+\-_\*%@&\/\\\\?!#~\[\]\(\)';
$notdelim      = '\p{L}\p{M}\p{N}\p{Pc}\p{Pd}' . $urlchars;
$predelim      = '((?<=[^' . $notdelim . '])|^)';
$postdelim     = '((?=[^'  . $notdelim . '])|$)';
 
$fullstop      = '\x{002E}\x{FE52}\x{FF0E}';
$comma         = '\x{002C}\x{FE50}\x{FF0C}';
$arabsep       = '\x{066B}\x{066C}';
$numseparators = $fullstop . $comma . $arabsep;
$plus          = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}';
$minus         = '\x{2212}\x{208B}\x{207B}\p{Pd}';
$slash         = '[\/\x{2044}]';
 
$digits        = '[\p{N}' . $numseparators . ']+';
$sign          = '[' . $plus . $minus . ']?';
$exponent      = '([eE]' . $sign . $digits . ')?';
$number        = $sign . $digits . $exponent;
$fraction      = $number . '(' . $slash . $number . ')?';
 
$text = preg_replace( '/' . $predelim . $fraction . $postdelim . '/u',
    ' ', $text );

Limitations: When this code encounters a malformed fraction with an embedded space, like 1 / 2, it treats the space as a number delimiter and removes the 1 and 2, while leaving the slash.

Removing number ranges, ratios, and telephone numbers

A range is a number, a dash, and another number, such as 2006-2008. Either number could be a fraction, signed, or have an exponent. Any of the various dashes and minus signs could be used between the numbers.

A telephone number is a sequence of digit groups, separated by a space, dash, or full stop, such as 800-555-1234. International numbers may add a country code, with or without a plus, such as +39.055.555.123.

A ratio is a pair of numbers separated by a colon, such as 1:5. Unicode has several different sizes of colon, plus a special "ratio" character that looks the same but clarifies semantics.

  Normal Small Fullwidth
Colon \x{003A} = : \x{FE55} = ﹕ \x{FF1A} = :
Ratio \x{2236} = ∶    

The expression below extends the previous one to include ranges and ratios:

// Remove delimited numbers with signs, exponents, fractions
// ranges, and ratios
$urlchars      = '\.,:;\'=+\-_\*%@&\/\\\\?!#~\[\]\(\)';
$notdelim      = '\p{L}\p{M}\p{N}\p{Pc}\p{Pd}' . $urlchars;
$predelim      = '((?<=[^' . $notdelim . '])|^)';
$postdelim     = '((?=[^'  . $notdelim . '])|$)';
 
$fullstop      = '\x{002E}\x{FE52}\x{FF0E}';
$comma         = '\x{002C}\x{FE50}\x{FF0C}';
$arabsep       = '\x{066B}\x{066C}';
$numseparators = $fullstop . $comma . $arabsep;
$plus          = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}';
$minus         = '\x{2212}\x{208B}\x{207B}\p{Pd}';
$slash         = '[\/\x{2044}]';
$colon         = ':\x{FE55}\x{FF1A}\x{2236}';
 
$digits        = '[\p{N}' . $numseparators . ']+';
$sign          = '[' . $plus . $minus . ']?';
$exponent      = '([eE]' . $sign . $digits . ')?';
$number        = $sign . $digits . $exponent;
$fraction      = $number . '(' . $slash . $number . ')?';
$numpair       = $fraction . '([' . $minus . $colon . $fullstop . ']' . $fraction . ')*';
 
$text = preg_replace( '/' . $predelim . $numpair . $postdelim . '/u',
    ' ', $text );

Removing number modifiers

Several characters found next to or embedded within a number should be removed along with the number, including:

  • A currency symbol before or after a number.
  • A number sign (#) before a number
  • A percent, per thousand, or per ten thousand sign after a number.
  • A units of measure symbol after a number.
  • An AM or PM after a number.

All currency symbols can be matched using the Unicode currency symbol category \p{Sc}. The remaining characters need to be listed explicitly:

  • Units of measure in the Unicode other symbols category:
    • \x{00B0} for the degree symbol.
    • \x{2103} for the degree celsius symbol.
    • \x{2109} for the degree fahrenheit symbol.
    • \x{23CD} for the square foot symbol.
    • \x{32CC} to \x{32CE} for units symbols.
    • \x{3300} to \x{3357} for ideographic units symbols.
    • \x{3371} to \x{33DF} for more units symbols.
    • \x{33FF} for the gallon symbol
  • Punctuation in the Unicode other punctuation category:
    • \x{0023} for the number sign.
    • \x{0025}, \x{FE64}, and \x{FF05} for the percent signs.
    • \x{2030} for the per mille sign.
    • \x{2031} for the per ten thousand sign.

Any of these symbols could be used within a number range, such as $30-$40. A sign could come before or after these symbols, such as -$20 and $-20.

The expression below extends the previous one to include currency, units of measure, AM/PM, and percents:

// Remove delimited numbers with signs, exponents, fractions,
// ranges, ratios, currency, am/pm, percent, and units
$urlchars      = ':;\'=+\-_\*%@&\/\\\\?!#~\[\]\(\)';
$notdelim      = '\p{L}\p{M}\p{N}\p{Pc}\p{Pd}' . $urlchars;
$predelim      = '((?<=[^' . $notdelim . '])|^)';
$postdelim     = '((?=[^'  . $notdelim . '])|$)';
 
$fullstop      = '\x{002E}\x{FE52}\x{FF0E}';
$comma         = '\x{002C}\x{FE50}\x{FF0C}';
$arabsep       = '\x{066B}\x{066C}';
$numseparators = $fullstop . $comma . $arabsep;
$plus          = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}';
$minus         = '\x{2212}\x{208B}\x{207B}\p{Pd}';
$slash         = '[\/\x{2044}]';
$colon         = ':\x{FE55}\x{FF1A}\x{2236}';
$units         = '%\x{FF05}\x{FE64}\x{2030}\x{2031}';
$units        .= '\x{00B0}\x{2103}\x{2109}\x{23CD}';
$units        .= '\x{32CC}-\x{32CE}';
$units        .= '\x{3300}-\x{3357}';
$units        .= '\x{3371}-\x{33DF}';
$units        .= '\x{33FF}';
$percents      = '%\x{FE64}\x{FF05}\x{2030}\x{2031}';
$ampm          = '([aApP][mM])';
  
$digits        = '[\p{N}' . $numseparators . ']+';
$sign          = '[' . $plus . $minus . ']?';
$exponent      = '([eE]' . $sign . $digits . ')?';
$prenum        = $sign . '[\p{Sc}#]?' . $sign;
$postnum       = '([\p{Sc}' . $units . $percents . ']|' . $ampm . ')?';
$number        = $prenum . $digits . $exponent . $postnum;
$fraction      = $number . '(' . $slash . $number . ')?';
$numpair       = $fraction . '([' . $minus . $colon . ']' . $fraction . ')*';
 
$text = preg_replace( '/' . $predelim . $numpair . $postdelim . '/u',
    ' ', $text );

Limitations: Some malformed numbers are matched, such as -$-20 or #1,2.3:4-5$.

Numbers followed by units of measure that don't use special Unicode characters are not removed, such as 20ft or 30m/s. Removing these would require that letters be treated as number delimiters. This would then remove the "30" in 30m/s, but it would also remove "30" in user30m@example.com or http://example.com/30m/s, which is undesirable. It is conceivable that the regular expressions could be extended to watch for specific units of measure names, such as "ft", "km", and "ml". Still, these would match user30ft@example.com too. Without a much more complex text parser, it isn't possible to distinguish good from bad matches.

Shorthands like "x10" or "10x" for "10 times" are not matched. Neither are "1st", "2nd", "3rd", etc. This is the same letter matching problem as above.

Removing consecutive spaces

The above expression replaces numbers with spaces. This avoids joining words adjacent to the number, but it can leave multiple consecutive spaces. To clean up, remove them.

// Remove consecutive spaces
$text = preg_replace( '/ +/', ' ', $text);

Other issues

For this to work reliably:

  • Before removing symbols, use the web page's content type to get its character set, then convert to UTF-8 using the iconv() function. This insures that the text is in the UTF-8 character encoding. Running preg_replace() with the /u pattern modifier on non-UTF-8 text sometimes causes the function to abort and return an empty string or the original text unprocessed.
  • Strip HTML tags and decode HTML entities first. This gives you pure UTF-8 text with all HTML-specific symbols already removed.

Downloads

Further reading

Related articles at NadeauSoftware.com

Web articles and specifications

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting