A spammer’s email harvester is a web spider that crawls through the pages of your site looking for email addresses. To protect your addresses, hide the pages that contain them. Use a
robots.txt file or
<meta> tags to stop well-behaved harvesters (are there any?), and hidden links, redirects, forms, and frames to try to stop the rest. The email harvesters tested in this article were stopped by some of these tricks, but not by others.
Table of Contents
- How to hide a web page from email harvesters
- Use “robots.txt” for the web site
- Use meta tags on pages
- Use “nofollow” on links
- Use Flash to link to a hidden page
- Use a form to link to a hidden page
- Embed a page within a frame
- Redirect to a new page
- Further reading
This article is part of a series on Effective methods to protect email addresses from spammers that compares and tests 50 ways to protect email addresses published on a web site.
How to hide a web page from email harvesters
A web spider (“robot”, “bot”) is an automatic program that wanders through your web site looking at each of its pages. Search engines use web spiders to find pages to add to their search indexes. Spammers use spiders to harvest email addresses from your pages. Some of the same techniques used to block search engines will work to block harvesters as well.
To find the pages at your site, a spammer’s harvester (“spam robot” or “spambot”) first reads your home page, or a page found using a search engine. After extracting any email addresses that the page contains, it looks at each of the page’s links and reads those pages. Then it follows their links, and so on, page after page. To hide a page containing email addresses (or anything else), make all of the links to that page difficult for a harvester spider to follow. And if a harvester can’t find the page, it can’t add the addresses to its mailing lists and you’ll get less spam.
Be sure to hide all links to a hidden page. Watch particularly for automatically generated site maps, page listings, directory listings, and RSS feeds. If there is an unhidden link anywhere, a web spider will find it and get to the hidden page.
Below I discuss several methods to hide pages from a search engine’s spider or a spammer’s spambot. After this list, I report the results of running these methods past a collection of email harvesters to see which methods were effective at hiding web pages containing email addresses, and which were not.
Use “robots.txt” for the web site
A web site’s “robots.txt” file tells web spiders which parts of the site are available to spiders, and which are not. Since email harvesters are spiders, they should honor this file. That’s probably optimistic, but you can try it anyway.
If you are using a content management system, such as Drupal or WordPress, you probably already have this file in the root folder of your site. You can edit it with a text editor and add “Disallow” lines for each page you want to hide from a spider or harvester. For instance, if you want to protect your site’s contact page containing your email addresses, disallow spider access to that page.
Use meta tags on pages
An HTML <meta> tag adds special notes to a page. The “robots” note tells spiders whether they should index the page and follow its links. Email harvesters and other spiders are supposed to honor this.
The meta tag for robots belongs in the <head> part of a page. The tag’s ”name“ attribute is “robots” and its “content” is a pair of words, separated by commas. Include the “noindex” word and spiders should ignore the page’s content, excluding it from search indexes and (in principal) harvested email address lists. Add the “nofollow” word and spiders and harvesters should not follow its links.
For example, add the following to pages that you do not want indexed by a spider or spambot, such as a page of email addresses:
Add this to any page containing links to a page that you don’t want a spider or spambot to find:
When using “nofollow”, all links on the page are affected. Using this tag on a home page or site map would hide most or all of your web site from harvesters and from search engine spiders. Unless you want to hide your site from search engines, use this tag only on selected internal pages and not on the home page. For finer control of which links to block, and which to allow, instead use the “nofollow” attribute on individual links, discussed next.
Use “nofollow” on links
Every link on a page provides a path for a spider or spambot to follow and get to another page. You can mark links that should not be followed by adding a “nofollow” attribute to the link. Well-behaved spiders, like those for search engines, will honor the attribute. Email harvesters may or may not honor it.
Use Flash to link to a hidden page
Adobe’s Flash plugin can show a flash object that embeds a link to another page. Site visitors can click on the object to get the page, but web spiders and harvesters can’t follow the link.
The Flash plugin is free and may be downloaded from Adobe’s web site. It’s available for most operating systems and most web browsers. Site visitors must download and install the plugin or your link will not be visible or clickable, but most visitors already have the plugin.
Adobe’s Dreamweaver web authoring application can build “Flash Text” links like this with a few mouse clicks. In Dreamweaver’s menus, select Insert > Media > Flash Text and fill out the dialog box.
Use a form to link to a hidden page
Clicking on a form button advances the visitor to another web page, similar to following a page link. While the intent of forms is to collect information from a visitor (name, address, credit card number, etc.), you can create a valid form that collects nothing. The form button is just a fancy link.
Web spiders and spambots may ignore pages that are only reachable via a form button. Such hidden pages may be a safe place for protected email addresses. Site visitors can still reach the page by clicking on the form button.
Embed a page within a frame
A “frame” incorporates a second web page into the body of a main page. There are two types of frames: <frame> and <iframe>.
A <frame> tag is used as part of a <frameset> tag, which replaces the <body> of a page with a collection of adjacent embedded pages. You’ve probably seen pages like this where a menu remains stationary on the side of the window while a center region contains a page you can scroll up and down. Each <frame> tag in a <frameset> gives the URL of an embedded page (such as one for the stationary menu and another for the center document). If a web spider or spammer harvester does not recognize <frame> tags, it will not find the embedded pages, making those pages a good place to hide email addresses. All current web browsers support frames, so site visitors will see the embedded page.
A <frame> and <frameset> take over the entire web page. Instead, an <iframe> tag takes over only part of a page. Otherwise, it works about the same as a <frame>. All current web browsers support <iframe> tags.
Redirect to a new page
To follow a web page link, a browser, spider, or spambot asks a web server for the linked-to page. Servers usually respond with the page’s text, but they can also respond with a “redirect” if the page has moved to a new location. For a redirect, the browser then asks for the redirected-to page. While browsers and web spiders follow redirects automatically, harvesters might not. The redirected-to page may be hidden from the harvester and a safe place to hide email addresses.
You can redirect to a “mailto” link. Browsers treat this the same as a “mailto” on the original page.
Or redirect to a web page:
These examples use a PHP script on the server to respond with a redirect. You also can use Perl, ASP, ColdFusion, etc. to do the same thing. Examples are available from James Thornton in his article Redirect mailto: for Spam Prevention. You will need the appropriate scripting engine enabled on your web server.
Instead of using a PHP script, you can use an Apache web server “Redirect” directive with the “mod_alias” module. The directive’s first argument is the name of the page to redirect away from, and the second argument is the destination of the redirect. The directive can go in the server’s main configuration file, or in a “.htaccess” file in your web site.
Instead of using an Apache directive or server-side scripting, you can add an HTML meta refresh tag to the top of a page to redirect that page to a new page. With this method, a link on a page leads to a blank intermediate page that contains this meta tag in its header. Browsers immediately recognize the tag and redirect to the next page. Harvesters may not follow the meta redirect.
I tested 23 widely-available email harvesters to see how well these methods work to protect email addresses on hidden pages. Each email harvester was aimed at an Apache web server with a set of test pages that used these methods to hide links or pages. In the table below, a harvester gets a check mark if it found the hidden page with a protected email address.
All of the harvesters were tested on Windows XP SP2. The names of the email harvesters are intentionally left off to avoid giving this web page search engine attention for spammers looking for the ”best” harvester to download.
|Plain email address||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√|
|Use “robots.txt” for the web site
|Use meta tags on pages - nofollow||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√|
|Use meta tags on pages - noindex||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√|
|Use “nofollow” on links
|Use Flash to link to a hidden page|
|Use a form to link to a hidden page||√||√||√||√||√|
|Embed a page within a frame||√||√||√||√||√||√||√||√||√||√||√||√|
|Embed a page within an iframe||√||√||√||√||√||√||√||√||√|
|Redirect to a “mailto” link - PHP||√||√||√||√||√|
|Redirect to a web page - PHP||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√|
|Redirect to a web page - Apache||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√||√|
|Redirect to a web page - Meta refresh tag||√||√|
Every spammer email harvester found the plain email address that was not protected.
Unsurprisingly, none of the email harvesters honored web spider directions in the site’s “robots.txt” file, in page <meta> tags, or in links with “nofollow” added. The email addresses in these supposedly hidden pages were all found.
Links in forms and frames were followed by about half of the email harvesters, leading them to the hidden pages of email addresses.
PHP and Apache redirects to a web page were followed by nearly all of the email harvesters. Two spambots followed redirects from a <meta> tag refresh, and five recognized a redirect to a “mailto” link.
Web spider directions in “robots.txt”, <meta> tags, and links are not an effective way to stop email harvesters and reduce spam. Spammers just ignore these conventions and harvest the pages anyway.
Redirect methods to stop spammers are not effective. Redirects are such a common feature of the web that most email harvesters and web spiders handle them. Pages hidden behind a redirect are not safe.
Hiding pages behind links in forms and frames is not effective. These are standard ways to link to additional web pages and spammer harvesters and spiders follow them.
Flash links are effective at hiding pages from spammer email harvesters, but visitors must have the Flash plugin installed. Most visitors do, but some visitors block Flash animations as a way of reducing the number of web page ads that blink at them. These visitors won’t be able to follow the link to your hidden web page. Also, the screen readers used by the visually impaired cannot read the text in a Flash link. Flash links have poor usability and accessibility.
A weak spot for all of these methods is that you must protect all links to a hidden page. This may be hard to do when links are automatically created for site maps, RSS feeds, and article lists. If there is an unprotected link to a hidden page posted anywhere on your site, or at any other web site, an email harvester can get through.