Numbers in prices, quantities, dates, times, phone numbers, and addresses may not be of interest when processing a web page for a PHP search engine or keyword analysis tool. In international text there are around 900 different types of digits, currency symbols, and units of measure marks that need to be removed. This tip shows how to remove numbers and number-related characters.

## Table of Contents

This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, get the page's content type, convert to UTF-8, strip away HTML syntax, punctuation, and symbol characters, and break a page down into a keyword list.

## Introduction

Text processing starts by removing the least important parts of a web page. This may include stripping HTML tags, scripts, and styles, stripping punctuation characters (such as ! ? & ( ) # @), and stripping symbol characters (such as smileys, arrows, and mathematics symbols). This only leaves words and numbers. Stripping out numbers is also useful:

**Search indexing:**Numbers make poor search terms. Literal searching on "12.30" won't find "12.3", despite identical meanings. If people don't use numbers in searches, then remove them from web page text before adding the remaining content to the search index. This speeds up page indexing and reduces the storage needed for the index.**Keyword extraction:**The most important and frequently used words on a page give a rough idea of the page's topic (often used in tag clouds). On a financial report, the numbers add valuable detail, but in everyday text numbers are secondary and can be safely removed during keyword extraction.**Page statistics:**Numbers aren't useful when computing the grade level of a page's vocabulary. They may add to the word count of a page, but they are probably not useful in estimating the page's writing complexity. Remove the numbers first.

Removing numbers simplifies text processing, but it can remove necessary detail. For a web page listing CDs for sale, removing its numbers strips away release dates, track lengths, and prices. What's left may not be very useful. Similarly, removing numbers from a financial report, an algebra lesson, or a list of IP addresses rather damages the page.

For those tasks where number removal is desirable, the regular expressions in this article do a reasonable job. They remove digits, decimal and thousands separators, plus and minus signs, and currency symbols.

## Code

The following function uses `preg_replace()`

and Unicode (encoded with UTF-8) to strip numbers and number-related characters from international text. The regular expressions are explained later in this article. This function's only argument is the UTF-8 text to strip. The stripped text is returned.

Download: strip_numbers.zip.

/** * Strip numbers from text. */ function strip_numbers( $text ) { $urlchars = '\.,:;\'=+\-_\*%@&\/\\\\?!#~\[\]\(\)'; $notdelim = '\p{L}\p{M}\p{N}\p{Pc}\p{Pd}' . $urlchars; $predelim = '((?<=[^' . $notdelim . '])|^)'; $postdelim = '((?=[^' . $notdelim . '])|$)'; $fullstop = '\x{002E}\x{FE52}\x{FF0E}'; $comma = '\x{002C}\x{FE50}\x{FF0C}'; $arabsep = '\x{066B}\x{066C}'; $numseparators = $fullstop . $comma . $arabsep; $plus = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}'; $minus = '\x{2212}\x{208B}\x{207B}\p{Pd}'; $slash = '[\/\x{2044}]'; $colon = ':\x{FE55}\x{FF1A}\x{2236}'; $units = '%\x{FF05}\x{FE64}\x{2030}\x{2031}'; $units .= '\x{00B0}\x{2103}\x{2109}\x{23CD}'; $units .= '\x{32CC}-\x{32CE}'; $units .= '\x{3300}-\x{3357}'; $units .= '\x{3371}-\x{33DF}'; $units .= '\x{33FF}'; $percents = '%\x{FE64}\x{FF05}\x{2030}\x{2031}'; $ampm = '([aApP][mM])'; $digits = '[\p{N}' . $numseparators . ']+'; $sign = '[' . $plus . $minus . ']?'; $exponent = '([eE]' . $sign . $digits . ')?'; $prenum = $sign . '[\p{Sc}#]?' . $sign; $postnum = '([\p{Sc}' . $units . $percents . ']|' . $ampm . ')?'; $number = $prenum . $digits . $exponent . $postnum; $fraction = $number . '(' . $slash . $number . ')?'; $numpair = $fraction . '([' . $minus . $colon . $fullstop . ']' . $fraction . ')*'; return preg_replace( array( // Match delimited numbers '/' . $predelim . $numpair . $postdelim . '/u', // Match consecutive white space '/ +/u', ), ' ', $text ); }

## Example

Read an HTML file, convert to UTF-8, remove HTML tags, decode HTML entities into UTF-8, and strip out numbers:

/* Read an HTML file */

$raw_text = file_get_contents( $filename );

/* Get the file's character encoding from a <meta> tag */

preg_match( '@<meta\s+http-equiv="Content-Type"\s+' . 'content="([\w/]+)(;\s+charset=([^\s"]+))?@i',

$raw_Text, $matches );

$encoding = $matches[3];

/* Convert to UTF-8 before doing anything else */

$utf8_text = iconv( $encoding, "utf-8", $raw_text ); /* Strip HTML tags */ $utf8_text = strip_tags( $utf8_text ); /* Decode HTML entities */ $utf8_text = html_entity_decode( $utf8_text, ENT_NOQUOTES, "UTF-8" );/* Remove numbers */ $utf8_text = strip_numbers( $utf8_text );

On this input:

Remove standalone 1 23.4 5,678.9 and malformed numbers 1...,2.3,,,,45.. Remove IP addresses 127.0.0.1. Keep commas, and full stops in sentences. And domain.com names. Keep numbers in URLs http://site5.example.com:80/get?a=(123)&q=45. Remove signs -12 +34 and exponents -1.2e-34.5 +6.78E90. Keep hyphens and non-number dashes - like this - or -- this. Remove fractions 1/2 -3/4 5/-6 and ranges 2006-2007 -1--2. Keep slashes in /file/names.txt. Remove ratios 1:2 -2:-3 1/2:3/4 and phone numbers 1-800-555-1234 Keep colons used like: that, or http://this.com:80. Remove number signs #1 and times 12:00am 4:30PM-5:00pm Keep # without a number and am or pm. Remove percents 10% per-mille 10‰ and per-ten-thousand 10％ Keep percents in URLs http://example.com/a%20space.txt. Remove units symbols 12㎧ 32℉ Remove currency symbols $1 2¢ 3.4€ Keep stand-alone $ and € signs.

Generates this output:

Remove standalone and malformed numbers Remove IP addresses Keep commas, and full stops in sentences. And domain.com names. Keep numbers in URLs http://site5.example.com:80/get?a=(123)&q=45. Remove signs and exponents Keep hyphens and non-number dashes - like this - or -- this. Remove fractions and ranges Keep slashes in /file/names.txt. Remove ratios and phone numbers Keep colons used like: that, or http://this.com:80. Remove number signs and times Keep # without a number and am or pm. Remove percents per-mille and per-ten-thousand Keep percents in URLs http://example.com/a%20space.txt. Remove units symbols Remove currency symbols Keep stand-alone $ and € signs.

## Explanation

While most of the world's languages use the Arabic numerals 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9, there are additional digit symbols used in other languages, mathematics, and special cases. Unicode includes 290 such digit symbols. There are also 210 more numeric letter symbols, such as the Roman numerals Ⅰ Ⅱ Ⅲ Ⅳ Ⅼ Ⅿ, and those in Old Persian, Cuneiform, and others. Add another 336 special symbols for fractions (such as ¼ ½ ¾), superscripts (such as ¹ ³ ²), subscripts (such as ₄ ₅ ₆), circled and parenthesized digits (such as ⑦ ❽ ⑼), and more.

Numbers are more than digits. They also include number modifiers that give context to a symbol, like currency symbols (such as $ ¢ £ ¥ €), units of measure abbreviations (such as ㎓ ㎖ ㎞ ㎲ ㎧), plus, minus, percent, and so on. Numbers also come in groups, such as hours and minus in *12:00*, and digit groups in a phone number, like *1-800-555-1234*, or Internet address, like *127.0.0.1*.

To properly remove numbers, code must consider context. A dash should be removed when used as a minus or in a numeric range, like *2006-2007*, but not when used as a hyphen. A colon should be removed in a ratio, like *4:3*, but not when used as a phrase delimiter in a sentence. A full stop (period) should be removed as a decimal separator in *123.45*, but left alone at the end of a sentence or between words in a domain name, like *example.com*.

### Unicode categories

PHP's standard `preg_replace()`

function supports regular expressions and the `/u`

pattern modifier to match characters based upon their Unicode category. To match a character in a category, start with `\p{`

followed by a category code and `}`

. For instance, `\p{Nd}`

matches any Unicode *number digits* and `\p{Sc}`

matches *currency symbols*.

Below are all 30 Unicode categories, their codes for regular expressions, and a few examples. The *number* categories are the most relevant for number removal, of course, but there are also a few characters in the *symbol* and *punctuation* categories that are of interest.

Unicode 'Letter' category Code Name Examples `Ll`

Letter, lowercase a b ç ď ĕ ʑ ʘ π й `Lm`

Letter, modifier ˇ ˆ ๆ ゞ `Lo`

Letter, other א ก あ ア ꀀ 豈 `Lt`

Letter, titlecase ǅ ᾈ ᾨ `Lu`

Letter, uppercase Æ Δ Ω Ж Ç

Unicode 'Mark' category Code Name Examples `Mc`

Mark, spacing combining ூ ௗ ཿ `Me`

Mark, enclosing ۞ ⃟ ⃞ `Mn`

Mark, nonspacing ̺ ۖ ཹ

Unicode 'Number' category Code Name Examples `Nd`

Number, decimal digits 0 1 2 3 4 5 6 7 8 9 ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ h ٩ `Nl`

Number, letter Ⅰ Ⅱ Ⅲ Ⅳ Ⅼ Ⅿ 〸 〹 〺 `No`

Number, other ¼ ½ ¾ ¹ ³ ² ₄ ₅ ₆ ⑦ ❽ ⑼

Unicode 'Punctuation' category Code Name Examples `Pc`

Punctuation, connector _ ‿ ⁀ ⁔ ﹎ ﹏ `Pd`

Punctuation, dash - – — 〜 〰 `Pe`

Punctuation, close ) ] } ⁆ ❱ ﴿ ︶ ︾ ｣ `Pf`

Punctuation, final quote » ’ ” › `Pi`

Punctuation, initial quote « ‘ “ ‹ `Po`

Punctuation, other ' " # % & ! . : , ? ¿ `Ps`

Punctuation, open ( [ { ⁅ ❰ ﴾ ︿ ︽ ｢

Unicode 'Symbol' category Code Name Examples `Sc`

Symbol, currency $ ¢ £ € ¥ `Sk`

Symbol, modifier ^ ` ´ `Sm`

Symbol, mathematics + = < > `So`

Symbol, other § © ® ¶

Unicode 'Separator' category Code Name Examples `Zl`

Separator, line `Zp`

Separator, paragraph `Zs`

Separator, space space, en space, em space

Unicode 'Other' category Code Name Examples `Cc`

Other, control tab, linefeed, carriage return `Cf`

Other, format `Cn`

Other, not assigned `Co`

Other, private use Apple logo `Cs`

Other, surrogate

Unicode.org has definitive information about Unicode, including Unicode code charts listing all of Unicode's characters. However, FileFormat.info has more user-friendly Unicode information, including Unicode Character Categories listing all 30 categories and links to lists of characters within them. Wikipedia also many good articles on Unicode, including Mapping of Unicode characters and Punctuation.

### Removing number digits

Unicode decimal digits are matched by `\p{Nd}`

, number letters by `\p{Nl}`

, and other number characters by `\p{No}`

. Characters in all three categories are matched by `\p{N}`

.

All of these number characters can be removed for stand-alone numbers, like *123*, but not when they're embedded within words, such as:

- Names and technical terms like
*Lucky-7, SP1*, and*v1.0*. - Domain names and email addresses like
*user123@site5.example.com*. - URLs like
*http://b4.com:80/v1.htm*. - File names like
*/usr/local/bin/php5*or*C:\Program Files\Acrobat8*

A removable number is a sequence of digits, full stops (periods), and commas, delimited from the surrounding text. Valid delimiters are white space (separators and control characters), and most punctuation and symbols. For the regular expression below, it is more convenient to list characters that are *not* number delimiters. These include letters, marks (such as accents), numbers themselves, connectors (such as underscores), dashes, and web characters like @ : / \ (see the RFC3986 Uniform Resource Identifier (URI) specification for a full list of web characters).

Unicode has several variants for decimal and thousands separator characters embedded within a number. All of these may occur in formatted text:

Normal Small Fullwidth Full stop (period) `\x{002E} = .`

`\x{FE52} = ﹒`

`\x{FF0E} = ．`

Comma `\x{002C} = ,`

`\x{FE50} = ﹐`

`\x{FF0C} = ，`

Arabic decimal separator `\x{066B} = ٫`

Arabic thousands separator `\x{066C} = ٬`

// Remove delimited numbers $urlchars = '\.,:;\'=+\-_\*%@&\/\\\\?!#~\[\]\(\)'; $notdelim = '\p{L}\p{M}\p{N}\p{Pc}\p{Pd}' . $urlchars; $predelim = '((?<=[^' . $notdelim . '])|^)'; $postdelim = '((?=[^' . $notdelim . '])|$)'; $fullstop = '\x{002E}\x{FE52}\x{FF0E}'; $comma = '\x{002C}\x{FE50}\x{FF0C}'; $arabsep = '\x{066B}\x{066C}'; $numseparators = $fullstop . $comma . $arabsep; $digits = '[\p{N}' . $numseparators . ']+'; $text = preg_replace( '/' . $predelim . $digits . $postdelim . '/u', ' ', $text );

**Limitations:** Several punctuation characters in URLs also occur often in normal text, such as full stop (period), comma, colon, and parenthesis. To keep URLs intact, these characters are *not* number delimiters. While this will keep numbers in a URL like *http://example.com/123:4,5(6).txt*, it also prevents their removal when immediately preceded or followed by these punctuation characters. For instance, any number in parenthesis, like *(123)*, or at the end of a sentence is not removed, like *123*. A fix for this is to remove punctuation used in a non-URL context first, but that is beyond the scope of this article. Instead, see the article on stripping punctuation characters and use its code first, before removing numbers.

Also, hex numbers are not removed. This would require extending the digit list to include A-F, but that would also remove non-hex words that happen to use the same letters, such as "Bee", "Cab", "Feed", etc. It is probably possible to extend this expression to support specific forms of hex numbers, such as *#ABC* or* %xAB* or *ꯍ*.

### Removing signs and exponents

A plus or minus sign may precede a number. An exponent in scientific notation may follow a number, as in *-1.2e-34*. The number and exponent each may have a plus or minus sign.

Unicode includes five different plus signs: normal, small, fullwidth, subscript, and superscript. All of them should be removed when followed by a number.

Unicode also includes three minus signs: normal, subscript, and superscript. Technically, a dash is not a minus sign, but a dash is easier to type on a standard keyboard. All 18 dash symbols, including long and short dashes, can be matched as a category with `\p{Pd}`

. These dashes, and the minus signs, should be removed when followed by a number.

Normal Small Fullwidth Subscript Superscript Plus `\x{002B} = +`

`\x{FE62} = ﹢`

`\x{FF0B} = ＋`

`\x{208A} = ₊`

`\x{207A} = ⁺`

Minus `\x{2212} = −`

`\x{208B} = ₋`

`\x{207B} = ⁻`

The expression below extends the previous one to now include numbers with optional signs and exponents:

// Remove delimited numbers with signs and exponents $urlchars = '\.,:;\'=+\-_\*%@&\/\\\\?!#~\[\]\(\)'; $notdelim = '\p{L}\p{M}\p{N}\p{Pc}\p{Pd}' . $urlchars; $predelim = '((?<=[^' . $notdelim . '])|^)'; $postdelim = '((?=[^' . $notdelim . '])|$)'; $fullstop = '\x{002E}\x{FE52}\x{FF0E}'; $comma = '\x{002C}\x{FE50}\x{FF0C}'; $arabsep = '\x{066B}\x{066C}'; $numseparators = $fullstop . $comma . $arabsep;$plus = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}'; $minus = '\x{2212}\x{208B}\x{207B}\p{Pd}';$digits = '[\p{N}' . $numseparators . ']+';$sign = '[' . $plus . $minus . ']?'; $exponent = '([eE]' . $sign . $digits . ')?'; $number = $sign . $digits . $exponent;$text = preg_replace( '/' . $predelim .$number. $postdelim . '/u', ' ', $text );

### Removing fractions

A fraction is represented as a number, a slash, and another number. Either or both numbers may have a sign or exponent. While a simple slash character is often used, Unicode also defines a special "fraction slash" (`\x{2044}`

) that looks the same but clarifies the semantics.

The expression below extends the previous one to include single numbers and a pair of numbers in a fraction:

// Remove delimited numbers with signs, exponents, and fractions $urlchars = '\.,:;\'=+\-_\*%@&\/\\\\?!#~\[\]\(\)'; $notdelim = '\p{L}\p{M}\p{N}\p{Pc}\p{Pd}' . $urlchars; $predelim = '((?<=[^' . $notdelim . '])|^)'; $postdelim = '((?=[^' . $notdelim . '])|$)'; $fullstop = '\x{002E}\x{FE52}\x{FF0E}'; $comma = '\x{002C}\x{FE50}\x{FF0C}'; $arabsep = '\x{066B}\x{066C}'; $numseparators = $fullstop . $comma . $arabsep; $plus = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}'; $minus = '\x{2212}\x{208B}\x{207B}\p{Pd}';$slash = '[\/\x{2044}]';$digits = '[\p{N}' . $numseparators . ']+'; $sign = '[' . $plus . $minus . ']?'; $exponent = '([eE]' . $sign . $digits . ')?'; $number = $sign . $digits . $exponent;$fraction = $number . '(' . $slash . $number . ')?';$text = preg_replace( '/' . $predelim .$fraction. $postdelim . '/u', ' ', $text );

**Limitations:** When this code encounters a malformed fraction with an embedded space, like *1 / 2*, it treats the space as a number delimiter and removes the *1* and *2*, while leaving the slash.

### Removing number ranges, ratios, and telephone numbers

A range is a number, a dash, and another number, such as *2006-2008*. Either number could be a fraction, signed, or have an exponent. Any of the various dashes and minus signs could be used between the numbers.

A telephone number is a sequence of digit groups, separated by a space, dash, or full stop, such as *800-555-1234*. International numbers may add a country code, with or without a plus, such as *+39.055.555.123*.

A ratio is a pair of numbers separated by a colon, such as *1:5*. Unicode has several different sizes of colon, plus a special "ratio" character that looks the same but clarifies semantics.

Normal Small Fullwidth Colon `\x{003A} = :`

`\x{FE55} = ﹕`

`\x{FF1A} = ：`

Ratio `\x{2236} = ∶`

The expression below extends the previous one to include ranges and ratios:

// Remove delimited numbers with signs, exponents, fractions // ranges, and ratios $urlchars = '\.,:;\'=+\-_\*%@&\/\\\\?!#~\[\]\(\)'; $notdelim = '\p{L}\p{M}\p{N}\p{Pc}\p{Pd}' . $urlchars; $predelim = '((?<=[^' . $notdelim . '])|^)'; $postdelim = '((?=[^' . $notdelim . '])|$)'; $fullstop = '\x{002E}\x{FE52}\x{FF0E}'; $comma = '\x{002C}\x{FE50}\x{FF0C}'; $arabsep = '\x{066B}\x{066C}'; $numseparators = $fullstop . $comma . $arabsep; $plus = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}'; $minus = '\x{2212}\x{208B}\x{207B}\p{Pd}'; $slash = '[\/\x{2044}]';$colon = ':\x{FE55}\x{FF1A}\x{2236}';$digits = '[\p{N}' . $numseparators . ']+'; $sign = '[' . $plus . $minus . ']?'; $exponent = '([eE]' . $sign . $digits . ')?'; $number = $sign . $digits . $exponent; $fraction = $number . '(' . $slash . $number . ')?';$numpair = $fraction . '([' . $minus . $colon . $fullstop . ']' . $fraction . ')*';$text = preg_replace( '/' . $predelim .$numpair. $postdelim . '/u', ' ', $text );

### Removing number modifiers

Several characters found next to or embedded within a number should be removed along with the number, including:

- A currency symbol before or after a number.
- A number sign (#) before a number
- A percent, per thousand, or per ten thousand sign after a number.
- A units of measure symbol after a number.
- An AM or PM after a number.

All currency symbols can be matched using the Unicode *currency symbol* category `\p{Sc}`

. The remaining characters need to be listed explicitly:

- Units of measure in the Unicode
*other symbols*category:- \x{00B0} for the degree symbol.
- \x{2103} for the degree celsius symbol.
- \x{2109} for the degree fahrenheit symbol.
- \x{23CD} for the square foot symbol.
- \x{32CC} to \x{32CE} for units symbols.
- \x{3300} to \x{3357} for ideographic units symbols.
- \x{3371} to \x{33DF} for more units symbols.
- \x{33FF} for the gallon symbol

- Punctuation in the Unicode
*other punctuation*category:- \x{0023} for the number sign.
- \x{0025}, \x{FE64}, and \x{FF05} for the percent signs.
- \x{2030} for the per mille sign.
- \x{2031} for the per ten thousand sign.

Any of these symbols could be used within a number range, such as *$30-$40*. A sign could come before or after these symbols, such as *-$20* and *$-20*.

The expression below extends the previous one to include currency, units of measure, AM/PM, and percents:

// Remove delimited numbers with signs, exponents, fractions, // ranges, ratios, currency, am/pm, percent, and units $urlchars = ':;\'=+\-_\*%@&\/\\\\?!#~\[\]\(\)'; $notdelim = '\p{L}\p{M}\p{N}\p{Pc}\p{Pd}' . $urlchars; $predelim = '((?<=[^' . $notdelim . '])|^)'; $postdelim = '((?=[^' . $notdelim . '])|$)'; $fullstop = '\x{002E}\x{FE52}\x{FF0E}'; $comma = '\x{002C}\x{FE50}\x{FF0C}'; $arabsep = '\x{066B}\x{066C}'; $numseparators = $fullstop . $comma . $arabsep; $plus = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}'; $minus = '\x{2212}\x{208B}\x{207B}\p{Pd}'; $slash = '[\/\x{2044}]'; $colon = ':\x{FE55}\x{FF1A}\x{2236}';$digits = '[\p{N}' . $numseparators . ']+'; $sign = '[' . $plus . $minus . ']?'; $exponent = '([eE]' . $sign . $digits . ')?';$units = '%\x{FF05}\x{FE64}\x{2030}\x{2031}'; $units .= '\x{00B0}\x{2103}\x{2109}\x{23CD}'; $units .= '\x{32CC}-\x{32CE}'; $units .= '\x{3300}-\x{3357}'; $units .= '\x{3371}-\x{33DF}'; $units .= '\x{33FF}';$percents = '%\x{FE64}\x{FF05}\x{2030}\x{2031}';$ampm = '([aApP][mM])';$prenum = $sign . '[\p{Sc}#]?' . $sign; $postnum = '([\p{Sc}' . $units . $percents . ']|' . $ampm . ')?';$number =$prenum. $digits . $exponent .$postnum; $fraction = $number . '(' . $slash . $number . ')?'; $numpair = $fraction . '([' . $minus . $colon . ']' . $fraction . ')*'; $text = preg_replace( '/' . $predelim .$numpair. $postdelim . '/u', ' ', $text );

**Limitations:** Some malformed numbers are matched, such as *-$-20* or *#1,2.3:4-5$*.

Numbers followed by units of measure that don't use special Unicode characters are not removed, such as *20ft* or *30m/s*. Removing these would require that letters be treated as number delimiters. This would then remove the "30" in *30m/s*, but it would also remove "30" in *user30m@example.com* or *http://example.com/30m/s*, which is undesirable. It is conceivable that the regular expressions could be extended to watch for specific units of measure names, such as "ft", "km", and "ml". Still, these would match *user30ft@example.com* too. Without a much more complex text parser, it isn't possible to distinguish good from bad matches.

Shorthands like "x10" or "10x" for "10 times" are not matched. Neither are "1st", "2nd", "3rd", etc. This is the same letter matching problem as above.

### Removing consecutive spaces

The above expression replaces numbers with spaces. This avoids joining words adjacent to the number, but it can leave multiple consecutive spaces. To clean up, remove them.

// Remove consecutive spaces $text = preg_replace( '/ +/', ' ', $text);

### Other issues

For this to work reliably:

- Before removing symbols, use the web page's content type to get its character set, then convert to UTF-8 using the
`iconv()`

function. This insures that the text is in the UTF-8 character encoding. Running`preg_replace()`

with the`/u`

pattern modifier on non-UTF-8 text sometimes causes the function to abort and return an empty string or the original text unprocessed. - Strip HTML tags and decode HTML entities first. This gives you pure UTF-8 text with all HTML-specific symbols already removed.

## Downloads

- strip_numbers.zip
- Includes
`strip_numbers.php`

. The code is covered by the OSI BSD license so you can use, modify, redistribute, and sell as you see fit.

- Includes

## Further reading

### Related articles at NadeauSoftware.com

- PHP tip: How to strip punctuation characters from a web page. Strip away punctuation, formatting, control, and separator characters while handling special cases for periods, commas, single quotes, and others.
- PHP tip: How to strip symbols characters from a web page. Strip away symbols used for mathematics, science, diagrams, and miscellaneous use while handling special cases for units of measure and ideographic radicals and strokes.
- PHP tip: How to get a web page using the fopen wrappers. Use PHP's file reading functions to get a web page, handling web server redirects and user-agent strings.
- PHP tip: How to get a web page using CURL. Use PHP's CURL (Client URL) functions to get a web page, handling redirects, compressed content, cookies, and user-agent strings.
- PHP tip: How to get a web page content type. Get the MIME type and character encoding from the HTTP header or from the web page content.
- PHP tip: How to strip HTML tags, scripts, and styles from a web page. Remove invisible content between tag pairs, such as styles and scripts. Add line breaks around block-level tags to prevent word joining, and then remove all remaining tags.
- PHP tip: How to decode HTML entities on a web page. Convert all HTML character references and entities into UTF-8 multibyte characters.
- PHP tip: How to extract keywords from a web page. Get a good list of keywords from a web page by getting the web page text, converting it to UTF-8, stripping away HTML tags, punctuation, symbols, and numbers, and breaking the text into words.

### Web articles and specifications

- Unicode code charts. Unicode.org has definitive information about Unicode, including tables listing all of Unicode's characters.
- Unicode Character Categories. FileFormat.info has an excellent collection of pages on Unicode, including pages that lists the Unicode categories and the letters within them. The site's Unicode pages also include information on browser and font support.

## Comments

## Post new comment