Text processing

  • May 27, 2008

    Splitting apart and rebuilding URLs is essential for link checkers, phishing detectors, spiders, and so on. PHP's standard parse_url( ) function works pretty well to parse simple URLs, but it has problems with complex and relative URLs. Once split apart, there is no standard PHP function to reassemble the URL properly. This article reviews the official syntax of URLs, discusses URL parsing complexities, and provides new PHP functions to split apart a URL and join its parts together again.

  • September 1, 2007

    The HTML tags on a web page must be stripped away to get clean text for a PHP search engine, keyword extractor, or some other page analysis tool. PHP's standard strip_tags( ) function will do part of the job, but you need to strip out styles, scripts, embedded objects, and other unwanted page code first. This tip shows how.

  • October 13, 2007

    Numbers in prices, quantities, dates, times, phone numbers, and addresses may not be of interest when processing a web page for a PHP search engine or keyword analysis tool. In international text there are around 900 different types of digits, currency symbols, and units of measure marks that need to be removed. This tip shows how to remove numbers and number-related characters.

  • September 15, 2007

    When processing text for a search engine or analysis tool, code needs to strip out punctuation, formatting, spacing, and control characters to reveal indexable text. In international text there are hundreds of these characters, and some should be removed in one context, but not in another. This tip shows how.

  • September 29, 2007

    Most symbol characters, like + = © ™ ← → ☺ ♣ ♠, need to be stripped out of web page text before processing it in a search engine or text analysis tool. For international text there are thousands of symbol characters, but some should be removed in one context, but not in another. This tip shows how.

Syndicate content
Nadeau software consulting
Nadeau software consulting