When processing text for a search engine or analysis tool, code needs to strip out punctuation, formatting, spacing, and control characters to reveal indexable text. In international text there are hundreds of these characters, and some should be removed in one context, but not in another. This tip shows how.
Table of Contents
- Introduction
- Code
- Example
- Explanation
- Unicode character categories
- Removing line, paragraph, and word separators
- Removing control, formatting, and surrogate characters
- Removing web characters
- Removing brackets
- Removing quotes
- Removing non-quote characters used as quotes
- Removing dashes
- Removing connectors
- Removing number separators
- Removing other punctuation
- Removing consecutive spaces
- Other issues
- Downloads
- Further reading
This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, get the page's content type, convert to UTF-8, strip away HTML syntax, symbol characters, and numbers, and break a page down into a keyword list.
Introduction
Text processing begins by removing unneeded elements from a web page. This includes stripping HTML tags, scripts, and styles, stripping symbol characters (such as smileys, arrows, and mathematics symbols), and sometimes stripping numbers. Stripping punctuation is particularly useful:
- Search indexing: People search on words, not punctuation. Save time indexing a page, and storage in the search index, by removing the punctuation first.
- Keyword extraction: The most important and frequently used words on a page give a rough idea of the page's topic (often used in tag clouds). Punctuation adds tone and pacing to text, but rarely significant meaning, so it is usually ignored when extracting page keywords.
- Page statistics: Punctuation doesn't count when calculating the length of a document or the grade level of its vocabulary. Remove the punctuation first.
Natural language processing is needed to do text processing well, but its complexity is beyond the needs of many tasks. Instead, the regular expressions in this article do a reasonable job of removing punctuation for most languages. This leaves word and number tokens that are ready for further processing.
Code
The following function uses preg_replace() and Unicode (encoded with UTF-8) to strip punctuation characters from international text while handling special cases. The regular expressions are explained later in this article. This function's only argument is the UTF-8 text to strip. The stripped text is returned.
Download: strip_punctuation.zip.
/**
* Strip punctuation from text.
*/
function strip_punctuation( $text )
{
$urlbrackets = '\[\]\(\)';
$urlspacebefore = ':;\'_\*%@&?!' . $urlbrackets;
$urlspaceafter = '\.,:;\'\-_\*@&\/\\\\\?!#' . $urlbrackets;
$urlall = '\.,:;\'\-_\*%@&\/\\\\\?!#' . $urlbrackets;
$specialquotes = '\'"\*<>';
$fullstop = '\x{002E}\x{FE52}\x{FF0E}';
$comma = '\x{002C}\x{FE50}\x{FF0C}';
$arabsep = '\x{066B}\x{066C}';
$numseparators = $fullstop . $comma . $arabsep;
$numbersign = '\x{0023}\x{FE5F}\x{FF03}';
$percent = '\x{066A}\x{0025}\x{066A}\x{FE6A}\x{FF05}\x{2030}\x{2031}';
$prime = '\x{2032}\x{2033}\x{2034}\x{2057}';
$nummodifiers = $numbersign . $percent . $prime;
return preg_replace(
array(
// Remove separator, control, formatting, surrogate,
// open/close quotes.
'/[\p{Z}\p{Cc}\p{Cf}\p{Cs}\p{Pi}\p{Pf}]/u',
// Remove other punctuation except special cases
'/\p{Po}(?<![' . $specialquotes .
$numseparators . $urlall . $nummodifiers . '])/u',
// Remove non-URL open/close brackets, except URL brackets.
'/[\p{Ps}\p{Pe}](?<![' . $urlbrackets . '])/u',
// Remove special quotes, dashes, connectors, number
// separators, and URL characters followed by a space
'/[' . $specialquotes . $numseparators . $urlspaceafter .
'\p{Pd}\p{Pc}]+((?= )|$)/u',
// Remove special quotes, connectors, and URL characters
// preceded by a space
'/((?<= )|^)[' . $specialquotes . $urlspacebefore . '\p{Pc}]+/u',
// Remove dashes preceded by a space, but not followed by a number
'/((?<= )|^)\p{Pd}+(?![\p{N}\p{Sc}])/u',
// Remove consecutive spaces
'/ +/',
),
' ',
$text );
}
Example
Read an HTML file, convert to UTF-8, remove HTML tags, decode HTML entities into UTF-8, and strip out punctuation:
/* Read an HTML file */
$raw_text = file_get_contents( $filename );
/* Get the file's character encoding from a <meta> tag */
preg_match( '@<meta\s+http-equiv="Content-Type"\s+' . 'content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
$raw_Text, $matches );
$encoding = $matches[3];
/* Convert to UTF-8 before doing anything else */
$utf8_text = iconv( $encoding, "utf-8", $raw_text ); /* Strip HTML tags */ $utf8_text = strip_tags( $utf8_text ); /* Decode HTML entities */ $utf8_text = html_entity_decode( $utf8_text, ENT_NOQUOTES, "UTF-8" ); /* Remove punctuation */ $utf8_text = strip_punctuation( $utf8_text );
On this input:
Remove extra spaces non-breaking spaces and tabs.
Remove (parens), [brackets], {curlies}, and smileys :-) (*_*)
Remove 'single' "double" “smart” «angled» *asterisk*, and _underbar_ quotes.
Remove misused <less/greater-than> <double> quotes.
Remove quotes with punctuation "inside," and "outside".
Remove "'multiple'" quotes in “"*various*"' _combinations_.”
Keep possesive's and contract'ns.
- Remove bullet and parenthetical — dashes – and rows ----- _____.
Keep hy-phens and http://url-dashes.com and http://file.com/under_bars.
Keep minuses: -123 $-123 -$123; ranges: 123-456; & groups: 1-800-555-1234.
¿Remove sentence punctuation: commas, semi-colons; colons: etc.?
Remove multiple!! uses?? and overuses????!!!!
Keep numbers: 1,234.56 1.234,56, IPs 127.0.0.1, and odd uses: 1...2,34,5.6,,,7.
Keep times 12:00, ratios 4:3, fractions 1/2, and word/pairs.
Remove stand-alone slashes / asterisks * ats @ and... ellipsis.
• Remove misc punctuation: † ‡ ⁂ ⁓ ﹌.
Keep URLs http://site5.example.com:80/a%20b_.php?b=(c,)&d=[#e]-f!/g;*.h
Keep email@addresses.com.
Keep file paths /usr/bin and C:\Program Files\Games.
Keep number modifiers: 10% #1 1,000‰ and 10,000‱
Generates this output (line breaks added back in):
Remove extra spaces non-breaking spaces and tabs Remove parens brackets curlies and smileys Remove single double smart angled asterisk and underbar quotes Remove misused less/greater-than> double> quotes Remove quotes with punctuation inside and outside Remove multiple quotes in various combinations Keep possesive's and contract'ns Remove bullet and parenthetical dashes and rows Keep hy-phens and http://url-dashes.com and http://file.com/under_bars Keep minuses -123 $-123 -$123 ranges 123-456 groups 1-800-555-1234 Remove sentence punctuation commas semi-colons colons etc Remove multiple uses and overuses Keep numbers 1,234.56 1.234,56 IPs 127.0.0.1 and odd uses 1...2,34,5.6,,,7 Keep times 12:00 ratios 4:3 fractions 1/2 and word/pairs Remove stand-alone slashes asterisks ats and ellipsis Remove misc punctuation Keep URLs http://site5.example.com:80/a%20b_.php?b=(c,)&d=[#e]-f!/g;*.h Keep email@addresses.com Keep file paths /usr/bin and C:\Program Files\Games Keep number modifiers 10% #1 1,000‰ and 10,000‱
Explanation
Unicode supports over 100,000 characters spanning the world's languages. Every character is categorized as a letter, number, mark, punctuation, symbol, separator, or other character. To remove all punctuation on a web page, a regular expression could just delete all characters in the Unicode punctuation category.
However, punctuation removal is complicated by characters that change their meaning in different contexts. For example, a full stop (period) is removable punctuation at the end of a sentence, but not as a decimal separator in 123.45 or as a word separator in a domain name like example.com, or a file name like index.htm. An apostrophe is removable punctuation when used as a quote in 'this' but not in a contraction like can't, or a possessive like Dave's. A dash is removable punctuation when used parenthetically — like this — but not as a hyphen in up-to-date or as a minus in -123. A slash is removable punctuation when used in paired words like and/or and his/hers, or in an abbreviation like r/w or i/o, but not as a fraction mark in 1/2 or a URL file path like http://example.com/index.htm. A colon is removable punctuation when separating clauses in a sentence, but not when separating hours and minutes in 12:00, two values in a ratio like 4:3, or http in a URL like http://example.com/. And # % ‰ and ‱ are technically punctuation, but they should be left alone so that numbers maintain their meaning, as in 10% and #1. So, punctuation removal needs to consider context.
Unicode character categories
With over 100,000 characters in Unicode, it isn't practical to create regular expressions that list them one-by-one for removal. Instead, an expression can match an entire category of characters in preg_replace() by using the /u Unicode pattern modifier and \p{XX}, where XX is the category code. For instance, \p{Ps} matches any of Unicode's 66 open brackets, such as ( [ and {. A \p{Pe} matches any of Unicode's 65 closing brackets, such as ) ] and }.
Below are all 30 Unicode categories, their codes for regular expressions, and a few examples. The punctuation, separator, and other categories are the most relevant for punctuation removal. There are also a few characters in the symbol category that are of interest.
Unicode 'Letter' category Code Name Examples LlLetter, lowercase a b ç ď ĕ ʑ ʘ π й LmLetter, modifier ˇ ˆ ๆ ゞ LoLetter, other א ก あ ア ꀀ 豈 LtLetter, titlecase Dž ᾈ ᾨ LuLetter, uppercase Æ Δ Ω Ж Ç
Unicode 'Mark' category Code Name Examples McMark, spacing combining ூ ௗ ཿ MeMark, enclosing ۞ ⃟ ⃞ MnMark, nonspacing ̺ ۖ ཹ
Unicode 'Number' category Code Name Examples NdNumber, decimal digits 0 1 2 3 4 5 6 7 8 9 ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ h ٩ NlNumber, letter Ⅰ Ⅱ Ⅲ Ⅳ Ⅼ Ⅿ 〸 〹 〺 NoNumber, other ¼ ½ ¾ ¹ ³ ² ₄ ₅ ₆ ⑦ ❽ ⑼
Unicode 'Punctuation' category Code Name Examples PcPunctuation, connector _ ‿ ⁀ ⁔ ﹎ ﹏ PdPunctuation, dash - – — 〜 〰 PePunctuation, close ) ] } ⁆ ❱ ﴿ ︶ ︾ 」 PfPunctuation, final quote » ’ ” › PiPunctuation, initial quote « ‘ “ ‹ PoPunctuation, other ' " # % & ! . : , ? ¿ PsPunctuation, open ( [ { ⁅ ❰ ﴾ ︿ ︽ 「
Unicode 'Symbol' category Code Name Examples ScSymbol, currency $ ¢ £ € ¥ SkSymbol, modifier ^ ` ´ SmSymbol, mathematics + = < > SoSymbol, other § © ® ¶
Unicode 'Separator' category Code Name Examples ZlSeparator, line ZpSeparator, paragraph ZsSeparator, space space, en space, em space
Unicode 'Other' category Code Name Examples CcOther, control tab, linefeed, carriage return CfOther, format CnOther, not assigned CoOther, private use Apple logo CsOther, surrogate
Unicode.org has definitive information about Unicode, including Unicode code charts listing all of Unicode's characters. However, FileFormat.info has more user-friendly Unicode information, including Unicode Character Categories listing all 30 categories and links to lists of characters within them. Wikipedia also many good articles on Unicode, including Mapping of Unicode characters and Punctuation.
Removing line, paragraph, and word separators
Separator characters delimit lines, paragraphs, and words. The most common separator is a space character, but Unicode defines 18 different spaces, such as n- and m-sized spaces, and a non-breaking space. Replace all of these with a generic space to simplify content analysis and further regular expressions.
// Remove separator characters
$text = preg_replace( '/\p{Z}/u', ' ', $text );
Removing control, formatting, and surrogate characters
The other control category includes the old ASCII control characters for tab, line feed, form feed, and carriage return. The other formatting category includes invisible formatting characters, and the other surrogates are reserved markers used in UTF-16, but not UTF-8. Replace all of these with a space.
// Remove control, formatting, and surrogate characters
$text = preg_replace( '/[\p{Cc}\p{Cf}\p{Cs}]/u', ' ', $text );
Removing web characters
URLs, file paths, and email addresses are common in web page text. To keep these items whole, it is important to not remove embedded punctuation characters, such as the @ in person@example.com or the slashes in http://example.com/index.htm. The URL specification reserves 23 characters: . , : ; ' - _ * % @ & / ? ! # [ ] ( ) + = ~ $. File paths in Linux and Mac OS X use the same slash character found in URLs, such as /usr/bin, while Windows adds a colon and backslash, as in C:\Program Files. All of these characters need special handling.
In normal text, these characters are removable. This includes question mark, exclamation mark, and Full stop (period) at the end of a sentence, and commas, colon, and semi-colon separating phrases in a sentence. @ and & are also removable in non-web use, such as meet @ store or Dave & Rochell. To distinguish between removable and non-removable uses, let these characters be removed only when preceded or followed by a space. For instance, this will remove colons used as a phrase delimiter, but not within a URL like http://example.com:80. It will remove stand-alone slashes, but not those in a file path.
This space-before-or-after rule also preserves use of these characters in non-web contexts, such as:
- A colon used in ratios, such as 4:3, and to separate hours and minutes, such as 12:00.
- A slash used in fractions, like 1/2, and to separate months, days, and years in a date, such as 1/5/2007.
- A slash used in paired words, like and/or and his/hers, or in an abbreviation, like r/w, i/o, or s/he.
- A slash used in units of measure, such as miles/hour or meters/second.
- Remove number sign, dash, slash, backslash, full stop (period) and comma when followed by a space, but not when preceded by one. This keeps / and \ at the beginning of file paths, full stop and comma at the start of file names, dash used as a minus in -123, and number signs in #1.
- Remove percent when preceded by a space, but not when followed by one. This preserves percent as a number modifier in 10%.
Since this article is about punctuation removal, skip removing + = ~ $ in any context. Technically these are Unicode symbols, not punctuation (but see How to strip symbol characters from a web page).
// Remove web characters preceded or followed by a space $urlbrackets = '\[\]\(\)'; $urlspacebefore = ':;\'_\*%@&?!' . $urlbrackets; $urlspaceafter = '\.,:;\'\-_\*@&\/\\\\\?!#' . $urlbrackets; $urlall = '\.,:;\'\-_\*%@&\/\\\\\?!#' . $urlbrackets; $text = preg_replace( '/((?<= )|^)[' . $urlspacebefore . ']+/u', ' ', $text ); $text = preg_replace( '/[' . $urlspaceafter . ']+((?= )|$)/u', ' ', $text );
Limitations: This expression incorrectly removes slashes and closing brackets at the end of URLs. While brackets in URLs are rare, they are common in disambiguating URLs at Wikipedia, such as this URL for an article on file paths: http://en.wikipedia.org/wiki/Path_(computing). Apostrophe, asterisk, and underscore are also incorrectly removed if they are at the start or end of a file name or URL (very rare).
Removing brackets
Unicode defines 66 different opening brackets, matched by \p{Ps}, and 65 different closing brackets, matched by \p{Pe}. These include the common ( [ { } ] ) characters, as well as corner brackets 「 」 『 』 ﹁ ﹂ ﹃ ﹄ used as quotation marks in some East Asian languages.
Remove all brackets, except those in URLs, regardless of context:
// Remove all non-URL brackets
$text = preg_replace( '/[\p{Ps}\p{Pe}](?<![' . $urlbrackets . '])/u',
' ', $text );
Removing quotes
Unicode includes 11 opening quotes, matched by \p{Pi}, and 9 closing quotes, matched by \p{Pf}. These include the familiar tilted (“smart”) single and double quotes, ‘ “ ” ’, and single and double angle bracket quotes, ‹ « » ›. These do not include the apostrophe and straight double quotes, ' " in the other punctuation category (because they are not strictly used as quotes). Wikipedia has articles on English and non-English use of quotation marks.
When used as quotes, all of these characters may be safely removed.
// Remove opening and closing quotes
$text = preg_replace( '/[\p{Pi}\p{Pf}]/u', ' ', $text );
Removing non-quote characters used as quotes
There are several characters sometimes used in a quote-like way.
- An apostrophe, as in 'text'. (other punctuation)
- A double quote, as in "text". (other punctuation)
- An asterisk, as in *text*. (other punctuation)
- An underscore, as in _text_. (connector punctuation)
- Less-than and greater-than signs, as in <text>. (mathematics symbol)
For these cases, the character is removable as an opening quote if it is preceded by a space, and as a closing quote if it is followed by a space. This also preserves apostrophes, asterisks, and underscores embedded within URLs.
// Remove quote-like characters preceded or followed by a space $specialquotes = '\'"\*_<>'; $text = preg_replace( '/((?<= )|^)[' . $specialquotes . ']+/u', ' ', $text); $text = preg_replace( '/[' . $specialquotes . ']+((?= )|$)/u', ' ', $text );
Limitations: This also removes a few non-quote uses of these characters. Leading and trailing apostrophes are removed in abbreviations like '70s and maître d' and when used to denote a glottal stop, such as the Hawaiian 'okina.
Removing dashes
Unicode defines 18 different dash characters in the dash category matched by \p{Pd}. There are several special cases to remove:
- Dashes used parenthetically — like this — with one or more dashes.
- A dash used like a colon — like this.
- A leading dash used as a bullet or to introduce a line of dialog.
- A row of dashes used as a horizontal rule.
while retaining:
- A dash used to hyphenate a word, like up‐to‐date, and in compound adjectives, like non–Windows.
- A dash used in technical terms, such as CSS identifiers, domain names, and URLs.
- A dash used as a minus sign in -123, in a numeric range like 2006–2007, or to separate digit groups in a telephone number, like 555‒1234, or an ISBN book number, like 0-471-1-16507-7.
- A dash used as a date separator, such as 1-5-2007.
In properly formatted text, many of these cases use different types of dash characters. An en or em dash is used parenthetically, a hyphen character is used for hyphenation, a figure dash for numeric ranges, and a minus for negatives. But in real-world use, a simple dash character may be used for all of these, complicating punctuation removal.
Remove all dashes if followed by a space. This removes parenthetical, colon-like, and horizontal rule uses without removing hyphens minuses, or dashes in URLs and file names. Also remove all dashes if preceded by a space, and not followed by a number (\p{N}) or currency symbol (\p{Sc}). This removes dash bullets, but not minuses.
// Remove dashes followed by a space
$text = preg_replace( '/\p{Pd}+((?= )|$)/u', ' ', $text );
// Remove dashes preceded by a space and not followed by a number or currency
$text = preg_replace( '/((?<= )|^)\p{Pd}+(?![\p{N}\p{Sc}])/u', ' ', $text );
Limitations: Parenthetical dashes are sometimes used without a space before or after the dash. This use is hard to distinguish from hyphenation and is not removed by the above code. This code removes trailing dashes in file paths and URLs (rare).
Removing connectors
Unicode defines 10 different connector characters, such as the underscores, matched by \p{Pc}. Remove all connectors if not embedded within a word or URL.
// Remove connectors proceeded or followed by a space
$text = preg_replace( '/((?<= )|^)\p{Pc}+/u', ' ', $text );
$text = preg_replace( '/\p{Pc}+((?= )|$)/u', ' ', $text );
Limitations: An old C/C++ programming language convention creates identifiers with a leading underscore to mark internal functions. This use is hard to distinguish from a leading underscore used in emphasis, and will be incorrectly removed by the above code. This code also removes leading and trailing underscores in file paths and URLs (rare).
Removing number separators
Full stops (periods) and commas are used in many contexts. For punctuation use, remove:
- A full stop at the end of a sentence.
- A comma at the end of a phrase.
- Two or three full stops used as an ellipsis, like ...
- A row of full stops used as a horizontal rule.
while retaining:
- A full stop, comma, or Arabic decimal separator used as a decimal separators in a number, such as 123.45 or 123,45.
- A full stop, comma, or Arabic thousands separator used as a thousands separator in a number, such as 1,234 or 1.234.
- A full stop used as a digit group separator in a telephone number, such as +39.055.555.123, or an Internet address, such as 127.0.0.1.
- A full stop used as a date separator, such as 1.5.2007.
- A full stop in a domain name like example.com, or a URL or file name like index.htm.
Unicode has several variants of these characters for normal, small, and fullwidth use, any of which may occur in formatted text:
Normal Small Fullwidth Full stop (period) \x{002E} = .\x{FE52} = ﹒\x{FF0E} = .Comma \x{002C} = ,\x{FE50} = ﹐\x{FF0C} = ,Arabic decimal separator \x{066B} = ٫Arabic thousands separator \x{066C} = ٬
For all of these cases, remove one or more full stops or commas if followed by a space. This preserves these characters used within or at the start of a file name or URL.
// Remove number separators followed by a space
$fullstop = '\x{002E}\x{FE52}\x{FF0E}';
$comma = '\x{002C}\x{FE50}\x{FF0C}';
$arabsep = '\x{066B}\x{066C}';
$numseparators = $fullstop . $comma . $arabsep;
$text = preg_replace( '/[' . $numseparators . ']+((?= )|$)/u', ' ', $text );
Limitations: This expression will incorrectly remove trailing full stops in abbreviations, such as Dr. Bob, Jr. and in file names and URLs that end in a full stop or comma (rare).
Removing other punctuation
The above expressions have already removed all punctuation in the categories for dash, connector, quotes, and parenthesis. The other punctuation category remains and includes the question mark, exclamation mark, semicolon, and dozens of special characters for Greek, Hebrew, Arabic, and others. It also includes several of the characters allowed in URLs, email addresses, and file paths.
Remove all other punctuation except:
- Web characters, such as @ / & : ?.
- Decimal and thousands separators, such as full stop and comma.
- Special quote characters, such as *text*.
- Number signs used as a number indicator, such as #1.
- Percent signs, per thousand signs, and per ten thousand signs, % ‰ ‱, used as number modifiers.
- Single, double, triple, and quadruple primes used to indicate number units, such as feet and inches in 5′2″, or angular arcminutes and arcseconds in 3°8′30″.
In Unicode, some of these characters have normal, small, and large sizes that may be used in formatted text.
Normal Small Fullwidth Arabic percent sign \x{066A} = ٪Number sign \x{0023} = #\x{FE5F} = ﹟\x{FF03} = #Percent sign \x{0025} = %\x{FE6A} = ﹪\x{FF05} = %Per thousand sign \x{2030} = ‰Per ten thousand sign \x{2031} = ‱Prime \x{2032} = ′Double prime \x{2033} = ″Triple prime \x{2034} = ‴Quadruple prime \x{2057} = ⁗
Remove other punctuation except the above:
// Remove other punctuation, except special cases.
$numbersign = '\x{0023}\x{FE5F}\x{FF03}';
$percent = '\x{066A}\x{0025}\x{066A}\x{FE6A}\x{FF05}\x{2030}\x{2031}';
$prime = '\x{2032}\x{2033}\x{2034}\x{2057}';
$nummodifiers = $numbersign . $percent . $prime;
$text = preg_replace( '/\p{Po}(?<![' .
$urlall . $numseparators . $specialquotes . $nummodifiers. '])/u',
' ', $text );
Limitations: Language-specific punctuation special cases are not handled. For instance, this code will remove punctuation used as tone marks in the Fraser alphabet used by Lisu language in parts of China, Myanmar, India, and Thailand. These expressions will also remove punctuation used to indicate pronunciation shown in dictionaries using the International Phonetic Alphabet (IPA).
Removing consecutive spaces
Each of the above expressions replaces punctuation with spaces. This avoids joining together words adjacent to the punctuation, but it can leave multiple consecutive spaces. To clean up, remove them.
// Remove consecutive spaces $text = preg_replace( '/ +/', ' ', $text);
Other issues
For this to work reliably:
- Before removing punctuation, use the web page's content type to get its character set, then convert to UTF-8 using the
iconv()function. This insures that the text is in the UTF-8 character encoding. Runningpreg_replace()with the/upattern modifier on non-UTF-8 text sometimes causes the function to abort and return an empty string. - Strip HTML tags and decode HTML entities first. This gives you pure UTF-8 text with all HTML-specific punctuation already removed.
Downloads
- strip_punctuation.zip
- Includes
strip_punctuation.php. The code is covered by the OSI BSD license so you can use, modify, redistribute, and sell as you see fit.
- Includes
Further reading
Related articles at NadeauSoftware.com
- PHP tip: How to strip symbol characters from a web page. Strip away mathematics and miscellaneous symbols while handling special cases for numeric units of measure and for ideographic radicals and strokes used in East Asian languages.
- PHP tip: How to strip numbers from a web page. Strip away numbers and symbols for currency, plus, minus, and units of measure.
- PHP tip: How to get a web page using the fopen wrappers. Use PHP's file reading functions to get a web page, handling web server redirects and user-agent strings.
- PHP tip: How to get a web page using CURL. Use PHP's CURL (Client URL) functions to get a web page, handling redirects, compressed content, cookies, and user-agent strings.
- PHP tip: How to get a web page content type. Get the MIME type and character encoding from the HTTP header or from the web page content.
- PHP tip: How to strip HTML tags, scripts, and styles from a web page. Remove invisible content between tag pairs, such as styles and scripts. Add line breaks around block-level tags to prevent word joining, and then remove all remaining tags.
- PHP tip: How to decode HTML entities on a web page. Convert all HTML character references and entities into UTF-8 multibyte characters.
- PHP tip: How to extract keywords from a web page. Get a good list of keywords from a web page by getting the web page text, converting it to UTF-8, stripping away HTML tags, punctuation, symbols, and numbers, and breaking the text into words.
Web articles and specifications
- Unicode code charts. Unicode.org has definitive information about Unicode, including tables listing all of Unicode's characters.
- Unicode Character Categories. FileFormat.info has an excellent collection of pages on Unicode, including pages that lists the Unicode categories and the letters within them. The site's Unicode pages also include information on browser and font support.
- Mapping of Unicode characters. Wikipedia has a good general article on Unicode character categories.

Comments
Thanks
Thanks you very much for this function, it really helped me with doing a title -> friendly URL conversion for UTF-8 text
Regards.
This is great. Thanks for
This is great. Thanks for posting this.
Thanks a lot
Thanks a lot for providing such a nice and useful function.
Saved the day ;-)
thanx
great, just amazing!
I don't get it.
This takes out loads of "normal" characters which results in unreadable text.
If I use a simple string with no utf8 characters or any puctuation in it instead of a whole file:-
$description="This is a test with some description and no punctuation";
$description=strip_punctuation($description);
Results in this string:-
Th a t t w th om r t on an no un tuat on
Have you actually tested this? or am I missing something here?
Re: I don't get it
The function is correct. When I run your sample text through it, I get the identical text back again — as I should, since the text has no punctuation.
Have you configured PHP to use a default text encoding that is not UTF-8? If it isn't UTF-8, your innocent text string will be encoded with some other character encoding. When the strip_punctuation function uses UTF-8-specific regular expressions to find UTF-8 characters, they'll match unexpected characters in your non-UTF-8 text and produce odd results.
Excellent post, thanks!
Excellent post, thanks!
Thank you!
I appreciate this post, it is an excellent function. I'm pulling suggested keywords from content, this came in handy and didnt strip it all. Beer on me irl.
Extremely useful...
Wow, this is an excellent article with very useful information.
Thanks so much for posting this! It's saved me a lot of time!
Thanks!!!
Great article!!!
thanks!
thanks for this. all the encoding is a nightmare
Post new comment