Most symbol characters, like + = © ™ ← → ☺ ♣ ♠, need to be stripped out of web page text before processing it in a search engine or text analysis tool. For international text there are thousands of symbol characters, but some should be removed in one context, but not in another. This tip shows how.
When processing text for a search engine or analysis tool, code needs to strip out punctuation, formatting, spacing, and control characters to reveal indexable text. In international text there are hundreds of these characters, and some should be removed in one context, but not in another. This tip shows how.
The HTML tags on a web page must be stripped away to get clean text for a PHP search engine, keyword extractor, or some other page analysis tool. PHP's standard
strip_tags( ) function will do part of the job, but you need to strip out styles, scripts, embedded objects, and other unwanted page code first. This tip shows how.
Most Drupal web sites have a set of blocks that line the left or right sides of its web pages. Typical blocks are menus, lists of recent posts, and forms for logging in and searching. Every block adds to the work Drupal must do to assemble a page, but some blocks are particularly slow. To speed up your site, install the Block Cache module to create cached versions of your slowest blocks. This article benchmarks the impact of block caching for 29 common blocks.
Drupal blocks provide secondary content that often lines the left and right sides of Drupal web pages. Typical blocks are menus, lists of recent posts, and forms for logging in and searching. But every block on a page increases Drupal's work to assemble a page, slowing down your web site. Speed it up by disabling the blocks that have the biggest performance impact. This article benchmarks 32 common blocks and concludes with a few guidelines on what to watch out for when selecting blocks for your site.
PHP’s fopen wrappers enable the standard file functions to read web pages from a web server. A few additional calls are needed to set parameters for a web server request and to get the server’s HTTP response header. This tip shows how.
HTML entities encode special characters and symbols, such as
€ for €, or
© for ©. When building a PHP search engine or web page analysis tool, HTML entities within a page must be decoded into single characters to get clean parsable text. PHP’s standard
html_entity_decode() function will do the job, but you must use a rich character encoding, such as UTF-8, and multibyte character strings. This tip shows how.
A web page’s content type tells you the page's MIME type (such as “text/html” or “image/png”) and the character set used by page text. You'll need the character set to interpret the page's characters for text processing for a search engine or keyword extractor. The content type should be in the web server’s HTTP header for the page, but it also can be set in an HTML file’s
<meta> tag, or an XML file’s
<?xml> tag. This tip shows how to get the page's content type and extract the MIME type and character set.
The first step when building a PHP search engine, link checker, or keyword extractor is to get the web page from the web server. There are several ways to do this. From PHP 4 onwards, the most flexible way uses PHP’s CURL (Client URL) functions. This tip shows how.
Publishing an email address on a web page invites more spam. Protect your address by masking it from the email harvesters (spambots) used by spammers. This article tests 50 masking methods against 23 harvesters to see which methods work to stop spammers, and which do not.