PHP tip: How to strip punctuation characters from a web page

Technologies: PHP 4.3+, UTF-8

When processing text for a search engine or analysis tool, code needs to strip out punctuation, formatting, spacing, and control characters to reveal indexable text. In international text there are hundreds of these characters, and some should be removed in one context, but not in another. This tip shows how.

This article is both an independent article and part of an article series on How to extract keywords from a web page. The rest of the series looks at how to get a web page from a web server, get the page's content type, convert to UTF-8, strip away HTML syntax, symbol characters, and numbers, and break a page down into a keyword list.

Introduction

Text processing begins by removing unneeded elements from a web page. This includes stripping HTML tags, scripts, and styles, stripping symbol characters (such as smileys, arrows, and mathematics symbols), and sometimes stripping numbers. Stripping punctuation is particularly useful:

  • Search indexing: People search on words, not punctuation. Save time indexing a page, and storage in the search index, by removing the punctuation first.
  • Keyword extraction: The most important and frequently used words on a page give a rough idea of the page's topic (often used in tag clouds). Punctuation adds tone and pacing to text, but rarely significant meaning, so it is usually ignored when extracting page keywords.
  • Page statistics: Punctuation doesn't count when calculating the length of a document or the grade level of its vocabulary. Remove the punctuation first.

Natural language processing is needed to do text processing well, but its complexity is beyond the needs of many tasks. Instead, the regular expressions in this article do a reasonable job of removing punctuation for most languages. This leaves word and number tokens that are ready for further processing.

Code

The following function uses preg_replace() and Unicode (encoded with UTF-8) to strip punctuation characters from international text while handling special cases. The regular expressions are explained later in this article. This function's only argument is the UTF-8 text to strip. The stripped text is returned.

Download: strip_punctuation.zip.

/**
 * Strip punctuation from text.
 */
function strip_punctuation( $text )
{
    $urlbrackets    = '\[\]\(\)';
    $urlspacebefore = ':;\'_\*%@&?!' . $urlbrackets;
    $urlspaceafter  = '\.,:;\'\-_\*@&\/\\\\\?!#' . $urlbrackets;
    $urlall         = '\.,:;\'\-_\*%@&\/\\\\\?!#' . $urlbrackets;
 
    $specialquotes  = '\'"\*<>';
 
    $fullstop       = '\x{002E}\x{FE52}\x{FF0E}';
    $comma          = '\x{002C}\x{FE50}\x{FF0C}';
    $arabsep        = '\x{066B}\x{066C}';
    $numseparators  = $fullstop . $comma . $arabsep;
 
    $numbersign     = '\x{0023}\x{FE5F}\x{FF03}';
    $percent        = '\x{066A}\x{0025}\x{066A}\x{FE6A}\x{FF05}\x{2030}\x{2031}';
    $prime          = '\x{2032}\x{2033}\x{2034}\x{2057}';
    $nummodifiers   = $numbersign . $percent . $prime;
 
    return preg_replace(
        array(
        // Remove separator, control, formatting, surrogate,
        // open/close quotes.
            '/[\p{Z}\p{Cc}\p{Cf}\p{Cs}\p{Pi}\p{Pf}]/u',
        // Remove other punctuation except special cases
            '/\p{Po}(?<![' . $specialquotes .
                $numseparators . $urlall . $nummodifiers . '])/u',
        // Remove non-URL open/close brackets, except URL brackets.
            '/[\p{Ps}\p{Pe}](?<![' . $urlbrackets . '])/u',
        // Remove special quotes, dashes, connectors, number
        // separators, and URL characters followed by a space
            '/[' . $specialquotes . $numseparators . $urlspaceafter .
                '\p{Pd}\p{Pc}]+((?= )|$)/u',
        // Remove special quotes, connectors, and URL characters
        // preceded by a space
            '/((?<= )|^)[' . $specialquotes . $urlspacebefore . '\p{Pc}]+/u',
        // Remove dashes preceded by a space, but not followed by a number
            '/((?<= )|^)\p{Pd}+(?![\p{N}\p{Sc}])/u',
        // Remove consecutive spaces
            '/ +/',
        ),
        ' ',
        $text );
}

Example

Read an HTML file, convert to UTF-8, remove HTML tags, decode HTML entities into UTF-8, and strip out punctuation:

/* Read an HTML file */
$raw_text = file_get_contents( $filename );

/* Get the file's character encoding from a <meta> tag */
preg_match( '@<meta\s+http-equiv="Content-Type"\s+' . 'content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
$raw_Text, $matches );
$encoding = $matches[3];

/* Convert to UTF-8 before doing anything else */
$utf8_text = iconv( $encoding, "utf-8", $raw_text ); /* Strip HTML tags */ $utf8_text = strip_tags( $utf8_text ); /* Decode HTML entities */ $utf8_text = html_entity_decode( $utf8_text, ENT_NOQUOTES, "UTF-8" ); /* Remove punctuation */ $utf8_text = strip_punctuation( $utf8_text );

On this input:

Remove extra   spaces non-breaking spaces and	tabs.
Remove (parens), [brackets], {curlies}, and smileys :-) (*_*)
Remove 'single' "double" “smart” «angled» *asterisk*, and _underbar_ quotes.
Remove misused <less/greater-than> <double> quotes.
Remove quotes with punctuation "inside," and "outside".
Remove "'multiple'" quotes in “"*various*"' _combinations_.”
Keep possesive's and contract'ns.
- Remove bullet and parenthetical — dashes – and rows ----- _____.
Keep hy-phens and http://url-dashes.com and http://file.com/under_bars.
Keep minuses: -123 $-123 -$123; ranges: 123-456; & groups: 1-800-555-1234.
¿Remove sentence punctuation: commas, semi-colons; colons: etc.?
Remove multiple!! uses?? and overuses????!!!!
Keep numbers: 1,234.56 1.234,56, IPs 127.0.0.1, and odd uses: 1...2,34,5.6,,,7.
Keep times 12:00, ratios 4:3, fractions 1/2, and word/pairs.
Remove stand-alone slashes / asterisks * ats @ and... ellipsis.
• Remove misc punctuation: † ‡ ⁂ ⁓ ﹌.
Keep URLs http://site5.example.com:80/a%20b_.php?b=(c,)&d=[#e]-f!/g;*.h
Keep email@addresses.com.
Keep file paths /usr/bin and C:\Program Files\Games.
Keep number modifiers: 10% #1 1,000‰ and 10,000‱

Generates this output (line breaks added back in):

Remove extra spaces non-breaking spaces and tabs
Remove parens brackets curlies and smileys
Remove single double smart angled asterisk and underbar quotes
Remove misused less/greater-than> double> quotes
Remove quotes with punctuation inside and outside
Remove multiple quotes in various combinations
Keep possesive's and contract'ns
Remove bullet and parenthetical dashes and rows
Keep hy-phens and http://url-dashes.com and http://file.com/under_bars
Keep minuses -123 $-123 -$123 ranges 123-456 groups 1-800-555-1234
Remove sentence punctuation commas semi-colons colons etc
Remove multiple uses and overuses
Keep numbers 1,234.56 1.234,56 IPs 127.0.0.1 and odd uses 1...2,34,5.6,,,7
Keep times 12:00 ratios 4:3 fractions 1/2 and word/pairs
Remove stand-alone slashes asterisks ats and ellipsis
Remove misc punctuation
Keep URLs http://site5.example.com:80/a%20b_.php?b=(c,)&d=[#e]-f!/g;*.h
Keep email@addresses.com
Keep file paths /usr/bin and C:\Program Files\Games
Keep number modifiers 10% #1 1,000‰ and 10,000‱ 

Explanation

Unicode supports over 100,000 characters spanning the world's languages. Every character is categorized as a letter, number, mark, punctuation, symbol, separator, or other character. To remove all punctuation on a web page, a regular expression could just delete all characters in the Unicode punctuation category.

However, punctuation removal is complicated by characters that change their meaning in different contexts. For example, a full stop (period) is removable punctuation at the end of a sentence, but not as a decimal separator in 123.45 or as a word separator in a domain name like example.com, or a file name like index.htm. An apostrophe is removable punctuation when used as a quote in 'this' but not in a contraction like can't, or a possessive like Dave's. A dash is removable punctuation when used parenthetically — like this — but not as a hyphen in up-to-date or as a minus in -123. A slash is removable punctuation when used in paired words like and/or and his/hers, or in an abbreviation like r/w or i/o, but not as a fraction mark in 1/2 or a URL file path like http://example.com/index.htm. A colon is removable punctuation when separating clauses in a sentence, but not when separating hours and minutes in 12:00, two values in a ratio like 4:3, or http in a URL like http://example.com/. And # % ‰ and ‱ are technically punctuation, but they should be left alone so that numbers maintain their meaning, as in 10% and #1. So, punctuation removal needs to consider context.

Unicode character categories

With over 100,000 characters in Unicode, it isn't practical to create regular expressions that list them one-by-one for removal. Instead, an expression can match an entire category of characters in preg_replace() by using the /u Unicode pattern modifier and \p{XX}, where XX is the category code. For instance, \p{Ps} matches any of Unicode's 66 open brackets, such as ( [ and {. A \p{Pe} matches any of Unicode's 65 closing brackets, such as ) ] and }.

Below are all 30 Unicode categories, their codes for regular expressions, and a few examples. The punctuation, separator, and other categories are the most relevant for punctuation removal. There are also a few characters in the symbol category that are of interest.

Unicode 'Letter' category
Code Name Examples
Ll Letter, lowercase a b ç ď ĕ ʑ ʘ π й
Lm Letter, modifier ˇ ˆ ๆ ゞ
Lo Letter, other א ก あ ア ꀀ 豈
Lt Letter, titlecase Dž ᾈ ᾨ
Lu Letter, uppercase Æ Δ Ω Ж Ç
Unicode 'Mark' category
Code Name Examples
Mc Mark, spacing combining ூ ௗ ཿ
Me Mark, enclosing ۞ ⃟ ⃞
Mn Mark, nonspacing ̺ ۖ ཹ
Unicode 'Number' category
Code Name Examples
Nd Number, decimal digits 0 1 2 3 4 5 6 7 8 9 ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ h ٩
Nl Number, letter Ⅰ Ⅱ Ⅲ Ⅳ Ⅼ Ⅿ 〸 〹 〺
No Number, other ¼ ½ ¾ ¹ ³ ² ₄ ₅ ₆ ⑦ ❽ ⑼
Unicode 'Punctuation' category
Code Name Examples
Pc Punctuation, connector _ ‿ ⁀ ⁔ ﹎ ﹏
Pd Punctuation, dash - – — 〜 〰
Pe Punctuation, close ) ] } ⁆ ❱ ﴿ ︶ ︾ 」
Pf Punctuation, final quote » ’ ” ›
Pi Punctuation, initial quote « ‘ “ ‹
Po Punctuation, other ' " # % & ! . : , ? ¿
Ps Punctuation, open ( [ { ⁅ ❰ ﴾ ︿ ︽ 「
Unicode 'Symbol' category
Code Name Examples
Sc Symbol, currency $ ¢ £ € ¥
Sk Symbol, modifier ^ ` ´
Sm Symbol, mathematics + = < >
So Symbol, other § © ® ¶
Unicode 'Separator' category
Code Name Examples
Zl Separator, line  
Zp Separator, paragraph  
Zs Separator, space space, en space, em space
Unicode 'Other' category
Code Name Examples
Cc Other, control tab, linefeed, carriage return
Cf Other, format  
Cn Other, not assigned  
Co Other, private use Apple logo
Cs Other, surrogate  

Unicode.org has definitive information about Unicode, including Unicode code charts listing all of Unicode's characters. However, FileFormat.info has more user-friendly Unicode information, including Unicode Character Categories listing all 30 categories and links to lists of characters within them. Wikipedia also many good articles on Unicode, including Mapping of Unicode characters and Punctuation.

Removing line, paragraph, and word separators

Separator characters delimit lines, paragraphs, and words. The most common separator is a space character, but Unicode defines 18 different spaces, such as n- and m-sized spaces, and a non-breaking space. Replace all of these with a generic space to simplify content analysis and further regular expressions.

// Remove separator characters
$text = preg_replace( '/\p{Z}/u', ' ', $text );

Removing control, formatting, and surrogate characters

The other control category includes the old ASCII control characters for tab, line feed, form feed, and carriage return. The other formatting category includes invisible formatting characters, and the other surrogates are reserved markers used in UTF-16, but not UTF-8. Replace all of these with a space.

// Remove control, formatting, and surrogate characters
$text = preg_replace( '/[\p{Cc}\p{Cf}\p{Cs}]/u', ' ', $text );

Removing web characters

URLs, file paths, and email addresses are common in web page text. To keep these items whole, it is important to not remove embedded punctuation characters, such as the @ in person@example.com or the slashes in http://example.com/index.htm. The URL specification reserves 23 characters: . , : ; ' - _ * % @ & / ? ! # [ ] ( ) + = ~ $. File paths in Linux and Mac OS X use the same slash character found in URLs, such as /usr/bin, while Windows adds a colon and backslash, as in C:\Program Files. All of these characters need special handling.

In normal text, these characters are removable. This includes question mark, exclamation mark, and Full stop (period) at the end of a sentence, and commas, colon, and semi-colon separating phrases in a sentence. @ and & are also removable in non-web use, such as meet @ store or Dave & Rochell. To distinguish between removable and non-removable uses, let these characters be removed only when preceded or followed by a space. For instance, this will remove colons used as a phrase delimiter, but not within a URL like http://example.com:80. It will remove stand-alone slashes, but not those in a file path.

This space-before-or-after rule also preserves use of these characters in non-web contexts, such as:

  • A colon used in ratios, such as 4:3, and to separate hours and minutes, such as 12:00.
  • A slash used in fractions, like 1/2, and to separate months, days, and years in a date, such as 1/5/2007.
  • A slash used in paired words, like and/or and his/hers, or in an abbreviation, like r/w, i/o, or s/he.
  • A slash used in units of measure, such as miles/hour or meters/second.
There are a few exceptions:
  • Remove number sign, dash, slash, backslash, full stop (period) and comma when followed by a space, but not when preceded by one. This keeps / and \ at the beginning of file paths, full stop and comma at the start of file names, dash used as a minus in -123, and number signs in #1.
  • Remove percent when preceded by a space, but not when followed by one. This preserves percent as a number modifier in 10%.

Since this article is about punctuation removal, skip removing + = ~ $ in any context. Technically these are Unicode symbols, not punctuation (but see How to strip symbol characters from a web page).

// Remove web characters preceded or followed by a space
$urlbrackets    = '\[\]\(\)';
$urlspacebefore = ':;\'_\*%@&?!' . $urlbrackets;
$urlspaceafter  = '\.,:;\'\-_\*@&\/\\\\\?!#' . $urlbrackets;
$urlall         = '\.,:;\'\-_\*%@&\/\\\\\?!#' . $urlbrackets;
 
$text = preg_replace( '/((?<= )|^)[' . $urlspacebefore . ']+/u', ' ', $text );
$text = preg_replace( '/[' . $urlspaceafter . ']+((?= )|$)/u', ' ', $text );

Limitations: This expression incorrectly removes slashes and closing brackets at the end of URLs. While brackets in URLs are rare, they are common in disambiguating URLs at Wikipedia, such as this URL for an article on file paths: http://en.wikipedia.org/wiki/Path_(computing). Apostrophe, asterisk, and underscore are also incorrectly removed if they are at the start or end of a file name or URL (very rare).

Removing brackets

Unicode defines 66 different opening brackets, matched by \p{Ps}, and 65 different closing brackets, matched by \p{Pe}. These include the common ( [ { } ] ) characters, as well as corner brackets 「 」 『 』 ﹁ ﹂ ﹃ ﹄ used as quotation marks in some East Asian languages.

Remove all brackets, except those in URLs, regardless of context:

// Remove all non-URL brackets
$text = preg_replace( '/[\p{Ps}\p{Pe}](?<![' . $urlbrackets . '])/u',
    ' ', $text );

Removing quotes

Unicode includes 11 opening quotes, matched by \p{Pi}, and 9 closing quotes, matched by \p{Pf}. These include the familiar tilted (“smart”) single and double quotes, ‘ “ ” ’, and single and double angle bracket quotes, ‹ « » ›. These do not include the apostrophe and straight double quotes, ' " in the other punctuation category (because they are not strictly used as quotes). Wikipedia has articles on English and non-English use of quotation marks.

When used as quotes, all of these characters may be safely removed.

// Remove opening and closing quotes
$text = preg_replace( '/[\p{Pi}\p{Pf}]/u', ' ', $text );

Removing non-quote characters used as quotes

There are several characters sometimes used in a quote-like way.

  • An apostrophe, as in 'text'. (other punctuation)
  • A double quote, as in "text". (other punctuation)
  • An asterisk, as in *text*. (other punctuation)
  • An underscore, as in _text_. (connector punctuation)
  • Less-than and greater-than signs, as in <text>. (mathematics symbol)

For these cases, the character is removable as an opening quote if it is preceded by a space, and as a closing quote if it is followed by a space. This also preserves apostrophes, asterisks, and underscores embedded within URLs.

// Remove quote-like characters preceded or followed by a space
$specialquotes  = '\'"\*_<>';
 
$text = preg_replace( '/((?<= )|^)[' . $specialquotes . ']+/u', ' ', $text);
$text = preg_replace( '/[' . $specialquotes . ']+((?= )|$)/u', ' ', $text );

Limitations: This also removes a few non-quote uses of these characters. Leading and trailing apostrophes are removed in abbreviations like '70s and maître d' and when used to denote a glottal stop, such as the Hawaiian 'okina.

Removing dashes

Unicode defines 18 different dash characters in the dash category matched by \p{Pd}. There are several special cases to remove:

  • Dashes used parenthetically — like this — with one or more dashes.
  • A dash used like a colon — like this.
  • A leading dash used as a bullet or to introduce a line of dialog.
  • A row of dashes used as a horizontal rule.

while retaining:

  • A dash used to hyphenate a word, like up‐to‐date, and in compound adjectives, like non–Windows.
  • A dash used in technical terms, such as CSS identifiers, domain names, and URLs.
  • A dash used as a minus sign in -123, in a numeric range like 2006–2007, or to separate digit groups in a telephone number, like 555‒1234, or an ISBN book number, like 0-471-1-16507-7.
  • A dash used as a date separator, such as 1-5-2007.

In properly formatted text, many of these cases use different types of dash characters. An en or em dash is used parenthetically, a hyphen character is used for hyphenation, a figure dash for numeric ranges, and a minus for negatives. But in real-world use, a simple dash character may be used for all of these, complicating punctuation removal.

Remove all dashes if followed by a space. This removes parenthetical, colon-like, and horizontal rule uses without removing hyphens minuses, or dashes in URLs and file names. Also remove all dashes if preceded by a space, and not followed by a number (\p{N}) or currency symbol (\p{Sc}). This removes dash bullets, but not minuses.

// Remove dashes followed by a space
$text = preg_replace( '/\p{Pd}+((?= )|$)/u', ' ', $text );
 
// Remove dashes preceded by a space and not followed by a number or currency
$text = preg_replace( '/((?<= )|^)\p{Pd}+(?![\p{N}\p{Sc}])/u', ' ', $text );

Limitations: Parenthetical dashes are sometimes used without a space before or after the dash. This use is hard to distinguish from hyphenation and is not removed by the above code. This code removes trailing dashes in file paths and URLs (rare).

Removing connectors

Unicode defines 10 different connector characters, such as the underscores, matched by \p{Pc}. Remove all connectors if not embedded within a word or URL.

// Remove connectors proceeded or followed by a space
$text = preg_replace( '/((?<= )|^)\p{Pc}+/u', ' ', $text );
$text = preg_replace( '/\p{Pc}+((?= )|$)/u', ' ', $text );

Limitations: An old C/C++ programming language convention creates identifiers with a leading underscore to mark internal functions. This use is hard to distinguish from a leading underscore used in emphasis, and will be incorrectly removed by the above code. This code also removes leading and trailing underscores in file paths and URLs (rare).

Removing number separators

Full stops (periods) and commas are used in many contexts. For punctuation use, remove:

  • A full stop at the end of a sentence.
  • A comma at the end of a phrase.
  • Two or three full stops used as an ellipsis, like ...
  • A row of full stops used as a horizontal rule.

while retaining:

  • A full stop, comma, or Arabic decimal separator used as a decimal separators in a number, such as 123.45 or 123,45.
  • A full stop, comma, or Arabic thousands separator used as a thousands separator in a number, such as 1,234 or 1.234.
  • A full stop used as a digit group separator in a telephone number, such as +39.055.555.123, or an Internet address, such as 127.0.0.1.
  • A full stop used as a date separator, such as 1.5.2007.
  • A full stop in a domain name like example.com, or a URL or file name like index.htm.

Unicode has several variants of these characters for normal, small, and fullwidth use, any of which may occur in formatted text:

  Normal Small Fullwidth
Full stop (period) \x{002E} = . \x{FE52} = ﹒ \x{FF0E} = .
Comma \x{002C} = , \x{FE50} = ﹐ \x{FF0C} = ,
Arabic decimal separator \x{066B} = ٫    
Arabic thousands separator \x{066C} = ٬    

For all of these cases, remove one or more full stops or commas if followed by a space. This preserves these characters used within or at the start of a file name or URL.

// Remove number separators followed by a space
$fullstop      = '\x{002E}\x{FE52}\x{FF0E}';
$comma         = '\x{002C}\x{FE50}\x{FF0C}';
$arabsep       = '\x{066B}\x{066C}';
$numseparators = $fullstop . $comma . $arabsep;
 
$text = preg_replace( '/[' . $numseparators . ']+((?= )|$)/u', ' ', $text );

Limitations: This expression will incorrectly remove trailing full stops in abbreviations, such as Dr. Bob, Jr. and in file names and URLs that end in a full stop or comma (rare).

Removing other punctuation

The above expressions have already removed all punctuation in the categories for dash, connector, quotes, and parenthesis. The other punctuation category remains and includes the question mark, exclamation mark, semicolon, and dozens of special characters for Greek, Hebrew, Arabic, and others. It also includes several of the characters allowed in URLs, email addresses, and file paths.

Remove all other punctuation except:

  • Web characters, such as @ / & : ?.
  • Decimal and thousands separators, such as full stop and comma.
  • Special quote characters, such as *text*.
  • Number signs used as a number indicator, such as #1.
  • Percent signs, per thousand signs, and per ten thousand signs, % ‰ ‱, used as number modifiers.
  • Single, double, triple, and quadruple primes used to indicate number units, such as feet and inches in 5′2″, or angular arcminutes and arcseconds in 3°8′30″.

In Unicode, some of these characters have normal, small, and large sizes that may be used in formatted text.

  Normal Small Fullwidth
Arabic percent sign \x{066A} = ٪
Number sign \x{0023} = # \x{FE5F} = ﹟ \x{FF03} = #
Percent sign \x{0025} = % \x{FE6A} = ﹪ \x{FF05} = %
Per thousand sign \x{2030} = ‰
Per ten thousand sign \x{2031} = ‱
Prime \x{2032} = ′
Double prime \x{2033} = ″
Triple prime \x{2034} = ‴
Quadruple prime \x{2057} = ⁗

Remove other punctuation except the above:

// Remove other punctuation, except special cases.
$numbersign   = '\x{0023}\x{FE5F}\x{FF03}';
$percent      = '\x{066A}\x{0025}\x{066A}\x{FE6A}\x{FF05}\x{2030}\x{2031}';
$prime        = '\x{2032}\x{2033}\x{2034}\x{2057}';
$nummodifiers = $numbersign . $percent . $prime;
 
$text = preg_replace( '/\p{Po}(?<![' .
    $urlall . $numseparators . $specialquotes . $nummodifiers. '])/u',
    ' ', $text );

Limitations: Language-specific punctuation special cases are not handled. For instance, this code will remove punctuation used as tone marks in the Fraser alphabet used by Lisu language in parts of China, Myanmar, India, and Thailand. These expressions will also remove punctuation used to indicate pronunciation shown in dictionaries using the International Phonetic Alphabet (IPA).

Removing consecutive spaces

Each of the above expressions replaces punctuation with spaces. This avoids joining together words adjacent to the punctuation, but it can leave multiple consecutive spaces. To clean up, remove them.

// Remove consecutive spaces
$text = preg_replace( '/ +/', ' ', $text);

Other issues

For this to work reliably:

  • Before removing punctuation, use the web page's content type to get its character set, then convert to UTF-8 using the iconv() function. This insures that the text is in the UTF-8 character encoding. Running preg_replace() with the /u pattern modifier on non-UTF-8 text sometimes causes the function to abort and return an empty string.
  • Strip HTML tags and decode HTML entities first. This gives you pure UTF-8 text with all HTML-specific punctuation already removed.

Downloads

Further reading

Related articles at NadeauSoftware.com

Web articles and specifications

Comments

Thanks

Thanks you very much for this function, it really helped me with doing a title -> friendly URL conversion for UTF-8 text

Regards.

This is great. Thanks for

This is great. Thanks for posting this.

Thanks a lot

Thanks a lot for providing such a nice and useful function.
Saved the day ;-)

thanx

great, just amazing!

I don't get it.

This takes out loads of "normal" characters which results in unreadable text.

If I use a simple string with no utf8 characters or any puctuation in it instead of a whole file:-

$description="This is a test with some description and no punctuation";
$description=strip_punctuation($description);

Results in this string:-
Th a t t w th om r t on an no un tuat on

Have you actually tested this? or am I missing something here?

Re: I don't get it

The function is correct. When I run your sample text through it, I get the identical text back again — as I should, since the text has no punctuation.

Have you configured PHP to use a default text encoding that is not UTF-8? If it isn't UTF-8, your innocent text string will be encoded with some other character encoding. When the strip_punctuation function uses UTF-8-specific regular expressions to find UTF-8 characters, they'll match unexpected characters in your non-UTF-8 text and produce odd results.

Excellent post, thanks!

Excellent post, thanks!

Thank you!

I appreciate this post, it is an excellent function. I'm pulling suggested keywords from content, this came in handy and didnt strip it all. Beer on me irl.

Extremely useful...

Wow, this is an excellent article with very useful information.

Thanks so much for posting this! It's saved me a lot of time!

Thanks!!!

Great article!!!

thanks!

thanks for this. all the encoding is a nightmare

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting