Stop spammer email harvesters by hiding web pages from the harvesters

A spammer’s email harvester is a web spider that crawls through the pages of your site looking for email addresses. To protect your addresses, hide the pages that contain them. Use a robots.txt file or <meta> tags to stop well-behaved harvesters (are there any?), and hidden links, redirects, forms, and frames to try to stop the rest. The email harvesters tested in this article were stopped by some of these tricks, but not by others.

This article is part of a series on Effective methods to protect email addresses from spammers that compares and tests 50 ways to protect email addresses published on a web site.

How to hide a web page from email harvesters

A web spider (“robot”, “bot”) is an automatic program that wanders through your web site looking at each of its pages. Search engines use web spiders to find pages to add to their search indexes. Spammers use spiders to harvest email addresses from your pages. Some of the same techniques used to block search engines will work to block harvesters as well.

To find the pages at your site, a spammer’s harvester (“spam robot” or “spambot”) first reads your home page, or a page found using a search engine. After extracting any email addresses that the page contains, it looks at each of the page’s links and reads those pages. Then it follows their links, and so on, page after page. To hide a page containing email addresses (or anything else), make all of the links to that page difficult for a harvester spider to follow. And if a harvester can’t find the page, it can’t add the addresses to its mailing lists and you’ll get less spam.

Be sure to hide all links to a hidden page. Watch particularly for automatically generated site maps, page listings, directory listings, and RSS feeds. If there is an unhidden link anywhere, a web spider will find it and get to the hidden page.

Below I discuss several methods to hide pages from a search engine’s spider or a spammer’s spambot. After this list, I report the results of running these methods past a collection of email harvesters to see which methods were effective at hiding web pages containing email addresses, and which were not.

Use “robots.txt” for the web site

A web site’s “robots.txt” file tells web spiders which parts of the site are available to spiders, and which are not. Since email harvesters are spiders, they should honor this file. That’s probably optimistic, but you can try it anyway.

If you are using a content management system, such as Drupal or WordPress, you probably already have this file in the root folder of your site. You can edit it with a text editor and add “Disallow” lines for each page you want to hide from a spider or harvester. For instance, if you want to protect your site’s contact page containing your email addresses, disallow spider access to that page.

robots.txt User-agent: *
Disallow: contact_page.html

Use meta tags on pages

An HTML <meta> tag adds special notes to a page. The “robots” note tells spiders whether they should index the page and follow its links. Email harvesters and other spiders are supposed to honor this.

The meta tag for robots belongs in the <head> part of a page. The tag’s ”name“ attribute is “robots” and its “content” is a pair of words, separated by commas. Include the “noindex” word and spiders should ignore the page’s content, excluding it from search indexes and (in principal) harvested email address lists. Add the “nofollow” word and spiders and harvesters should not follow its links.

For example, add the following to pages that you do not want indexed by a spider or spambot, such as a page of email addresses:

HTML <meta name="robots" content="noindex,follow" />

Add this to any page containing links to a page that you don’t want a spider or spambot to find:

HTML <meta name="robots" content="index,nofollow" />

When using “nofollow”, all links on the page are affected. Using this tag on a home page or site map would hide most or all of your web site from harvesters and from search engine spiders. Unless you want to hide your site from search engines, use this tag only on selected internal pages and not on the home page. For finer control of which links to block, and which to allow, instead use the “nofollow” attribute on individual links, discussed next.

Every link on a page provides a path for a spider or spambot to follow and get to another page. You can mark links that should not be followed by adding a “nofollow” attribute to the link. Well-behaved spiders, like those for search engines, will honor the attribute. Email harvesters may or may not honor it.

HTML <a href="contact_page.html" rel="nofollow">Contact page</a>

Use JavaScript to link to a hidden page

Spiders look for links to follow. To block spiders and spambots, but not legitimate visitors, change the way that links work by using JavaScript to create a similar effect. Harvesters and web spiders don’t run JavaScript scripts and cannot follow these links. Site visitors must have JavaScript enabled in their browsers or the link will not work for them either.

HTML <a href="javascript:window.location='contact_page.html';">Contact page</a>

The same approach will keep out all spiders, including those for search engines. Since search engines like Google are an important way for visitors to find your web site, over using JavaScript links can make your site invisible to search engines and unfindable by visitors.

Use Flash to link to a hidden page

Adobe’s Flash plugin can show a flash object that embeds a link to another page. Site visitors can click on the object to get the page, but web spiders and harvesters can’t follow the link.

HTML <object data="flash_contact_page.swf" type="application/x-shockwave-flash"
width=80 height=20><param name=movie value="flash_contact_page.swf"></object>
Result

The Flash plugin is free and may be downloaded from Adobe’s web site. It’s available for most operating systems and most web browsers. Site visitors must download and install the plugin or your link will not be visible or clickable, but most visitors already have the plugin.

Adobe’s Dreamweaver web authoring application can build “Flash Text” links like this with a few mouse clicks. In Dreamweaver’s menus, select Insert > Media > Flash Text and fill out the dialog box.

Use a form to link to a hidden page

Clicking on a form button advances the visitor to another web page, similar to following a page link. While the intent of forms is to collect information from a visitor (name, address, credit card number, etc.), you can create a valid form that collects nothing. The form button is just a fancy link.

Web spiders and spambots may ignore pages that are only reachable via a form button. Such hidden pages may be a safe place for protected email addresses. Site visitors can still reach the page by clicking on the form button.

HTML <form action="contact_page.html" method="post">
<input type="submit" value="Contact page"></form>
Result

Embed a page within a frame

A “frame” incorporates a second web page into the body of a main page. There are two types of frames: <frame> and <iframe>.

A <frame> tag is used as part of a <frameset> tag, which replaces the <body> of a page with a collection of adjacent embedded pages. You’ve probably seen pages like this where a menu remains stationary on the side of the window while a center region contains a page you can scroll up and down.  Each <frame> tag in a <frameset> gives the URL of an embedded page (such as one for the stationary menu and another for the center document). If a web spider or spammer harvester does not recognize <frame> tags, it will not find the embedded pages, making those pages a good place to hide email addresses. All current web browsers support frames, so site visitors will see the embedded page.

HTML <frameset rows="*" cols="100%">
<frame src="contact_page.html" />
</frameset>

A <frame> and <frameset> take over the entire web page. Instead, an <iframe> tag takes over only part of a page. Otherwise, it works about the same as a <frame>. All current web browsers support <iframe> tags.

HTML <iframe src="contact_page.html" />

Redirect to a new page

To follow a web page link, a browser, spider, or spambot asks a web server for the linked-to page. Servers usually respond with the page’s text, but they can also respond with a “redirect” if the page has moved to a new location. For a redirect, the browser then asks for the redirected-to page. While browsers and web spiders follow redirects automatically, harvesters might not. The redirected-to page may be hidden from the harvester and a safe place to hide email addresses.

You can redirect to a “mailto” link. Browsers treat this the same as a “mailto” on the original page.

HTML <a href="redirect_mailto.php">email me</a>
PHP <?php
  header( 'Location: mailto:person@example.com' );
  exit( );
?>

Or redirect to a web page:

HTML <a href="redirect_contact_page.php">Contact page </a>
PHP <?php
  header( 'Location: contact_page.html' );
  exit( );
?>

These examples use a PHP script on the server to respond with a redirect. You also can use Perl, ASP, ColdFusion, etc. to do the same thing. Examples are available from James Thornton in his article Redirect mailto: for Spam Prevention. You will need the appropriate scripting engine enabled on your web server.

Instead of using a PHP script, you can use an Apache web server “Redirect” directive with the “mod_alias” module. The directive’s first argument is the name of the page to redirect away from, and the second argument is the destination of the redirect. The directive can go in the server’s main configuration file, or in a “.htaccess” file in your web site.

HTML <a href="redirect_contact_page.html">Contact page </a>
Apache
Redirect /redirect_contact_page.html http://yoursite/contact_page.html

Instead of using an Apache directive or server-side scripting, you can add an HTML meta refresh tag to the top of a page to redirect that page to a new page. With this method, a link on a page leads to a blank intermediate page that contains this meta tag in its header. Browsers immediately recognize the tag and redirect to the next page. Harvesters may not follow the meta redirect.

HTML
(1st page)
<a href="redirect_contact_page.php">Contact page </a>
HTML
(2nd page)
<meta http-equiv="refresh" content="0;url=http://yoursite/contact_page.html" />

Results

I tested 23 widely-available email harvesters to see how well these methods work to protect email addresses on hidden pages.  Each email harvester was aimed at an Apache web server with a set of test pages that used these methods to hide links or pages.  In the table below, a harvester gets a check mark if it found the hidden page with a protected email address.

All of the harvesters were tested on Windows XP SP2. The names of the email harvesters are intentionally left off to avoid giving this web page search engine attention for spammers looking for the ”best” harvester to download.

Hidden web page test results
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Plain email address
Use “robots.txt” for the web site
Use meta tags on pages - nofollow
Use meta tags on pages - noindex
Use “nofollow” on links
Use JavaScript to link to a hidden page                                              
Use Flash to link to a hidden page                                              
Use a form to link to a hidden page                                    
Embed a page within a frame                      
Embed a page within an iframe                            
Redirect to a “mailto” link - PHP                                    
Redirect to a web page - PHP          
Redirect to a web page - Apache  
Redirect to a web page - Meta refresh tag                                          

Every spammer email harvester found the plain email address that was not protected.

Unsurprisingly, none of the email harvesters honored web spider directions in the site’s “robots.txt” file, in page <meta> tags, or in links with “nofollow” added. The email addresses in these supposedly hidden pages were all found.

The JavaScript and embedded Flash links were not followed by any email harvester.

Links in forms and frames were followed by about half of the email harvesters, leading them to the hidden pages of email addresses.

PHP and Apache redirects to a web page were followed by nearly all of the email harvesters. Two spambots followed redirects from a <meta> tag refresh, and five recognized a redirect to a “mailto” link.

Conclusions

Web spider directions in “robots.txt”, <meta> tags, and links are not an effective way to stop email harvesters and reduce spam. Spammers just ignore these conventions and harvest the pages anyway.

Redirect methods to stop spammers are not effective. Redirects are such a common feature of the web that most email harvesters and web spiders handle them. Pages hidden behind a redirect are not safe.

Hiding pages behind links in forms and frames is not effective. These are standard ways to link to additional web pages and spammer harvesters and spiders follow them.

JavaScript links are effective at hiding pages from web spiders, but visitors must have JavaScript enabled in their browser. Today, most people do. Until harvesters support JavaScript too, email harvesters will not be able to follow JavaScript links.

Flash links are effective at hiding pages from spammer email harvesters, but visitors must have the Flash plugin installed. Most visitors do, but some visitors block Flash animations as a way of reducing the number of web page ads that blink at them. These visitors won’t be able to follow the link to your hidden web page. Also, the screen readers used by the visually impaired cannot read the text in a Flash link. Flash links have poor usability and accessibility.

Recommendation: Of these methods, only JavaScript links were effective and had good usability and accessibility. This trick is widely used to hide web pages from web spiders, including those for search engines. If you need to protect an entire page of email addresses, this is an effective way to do so. But if you only need to protect a single email address, this is pretty cumbersome. The other articles in this series look at effective ways to protect individual addresses.

A weak spot for all of these methods is that you must protect all links to a hidden page. This may be hard to do when links are automatically created for site maps, RSS feeds, and article lists. If there is an unprotected link to a hidden page posted anywhere on your site, or at any other web site, an email harvester can get through.

Further reading

Comments

html redirect code

About Redirect to other web address // html code --

http://html-lesson.blogspot.com/2008/06/redirect-to-web-addres.html

I didn't know redirection

I didn't know redirection method wasn't effective, I've been advising people to used the technique all along.Thanks for letting me know.Great post,

Redirecting MAILTO with PHP

I was wondering what you think of this method for hiding email addresses:
Let's pretend my email address is "myname@mydomain.com".
I make a file called "email.php" containing the following:

<?php
header ("Location: mailto:$_GET[n]@$_GET[d]");
?>

And the HTML for the link looks like this:

<a href="/email.php?n=myname&d=mydomain">myname[at]mydomain[dot]com</a>

I'm really, really new to php, so I was just wondering if there are any holes in securing my email adresses this way. Other than that, I must say that this was a fantastic article! Thanks for sharing =).

Re: Redirecting MAILTO with PHP

My testing with a simpler version of this idea redirected to a fixed email address. Two harvesters caught the address. Adding parameters via PHP won't change the result. So, I regret that this is not a perfect solution.

Additionally, the link you suggest uses [at] and [dot] in place of @ and . characters. I tested a variant of this case and found one harvester that still picked out the address. So, your link text isn't safe either.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting