2007

  • July 14, 2007

    PHP’s fopen wrappers enable the standard file functions to read web pages from a web server. A few additional calls are needed to set parameters for a web server request and to get the server’s HTTP response header. This tip shows how.

  • June 30, 2007

    HTML entities encode special characters and symbols, such as € for €, or © for ©. When building a PHP search engine or web page analysis tool, HTML entities within a page must be decoded into single characters to get clean parsable text. PHP’s standard html_entity_decode() function will do the job, but you must use a rich character encoding, such as UTF-8, and multibyte character strings. This tip shows how.

  • June 16, 2007

    A web page’s content type tells you the page's MIME type (such as “text/html” or “image/png”) and the character set used by page text. You'll need the character set to interpret the page's characters for text processing for a search engine or keyword extractor. The content type should be in the web server’s HTTP header for the page, but it also can be set in an HTML file’s <meta> tag, or an XML file’s <?xml> tag. This tip shows how to get the page's content type and extract the MIME type and character set.

  • June 10, 2007

    The first step when building a PHP search engine, link checker, or keyword extractor is to get the web page from the web server. There are several ways to do this. From PHP 4 onwards, the most flexible way uses PHP’s CURL (Client URL) functions. This tip shows how.

  • May 15, 2007

    Publishing an email address on a web page invites more spam. Protect your address by masking it from the email harvesters (spambots) used by spammers. This article tests 50 masking methods against 23 harvesters to see which methods work to stop spammers, and which do not.

  • May 12, 2007

    Instead of publishing your email address to a web page, where it can be harvested by spammers, provide a contact form. Filling out the form sends you email without showing your email address to the site visitor (or spammer). To block automated programs from filling out the form, add a CAPTCHA challenge to detect human visitors. Site visitors will still be able to contact you, but spammers will be blocked.

  • May 10, 2007

    Legitimate web site visitors are there to read your content, but spammers only visit to run email harvesters (spambots) that scan your web pages for email addresses. To protect your addresses, and avoid wasting network bandwidth talking to spammers, change your web server configuration to block spammer access. Blacklist spammer IP addresses, block access from known harvester spiders, or require visitors to log in. Some of the methods tested in this article were successful at blocking email harvesters.

  • May 8, 2007

    A spammer’s email harvester is a web spider that crawls through the pages of your site looking for email addresses. To protect your addresses, hide the pages that contain them. Use a robots.txt file or <meta> tags to stop well-behaved harvesters (are there any?), and hidden links, redirects, forms, and frames to try to stop the rest. The email harvesters tested in this article were stopped by some of these tricks, but not by others.

  • May 7, 2007

    Spammers use email harvesters (spambots) to scan the text of your web pages looking for email addresses. Protect those addresses by replacing the text address with an image or Flash animation that draws the email address. None of the harvesters tested in this article could read addresses drawn with images or Flash.

  • May 4, 2007

    Email harvesters (spambots) scan your web pages for email addresses to add to spam mailing lists. Keep your address away from them by using JavaScript or CSS to insert your address after the web page has loaded into a visitor’s web browser. The harvester tests reported in this article show that harvesters do not run JavaScript or handle CSS styling, so they won’t find your address.

Syndicate content
Nadeau software consulting
Nadeau software consulting