Stop spammer email harvesters by blocking spammer access to the site

Technologies: Apache 1+ or 2+

Legitimate web site visitors are there to read your content, but spammers only visit to run email harvesters (spambots) that scan your web pages for email addresses. To protect your addresses, and avoid wasting network bandwidth talking to spammers, change your web server configuration to block spammer access. Blacklist spammer IP addresses, block access from known harvester spiders, or require visitors to log in. Some of the methods tested in this article were successful at blocking email harvesters.

This article is part of a series on Effective methods to protect email addresses from spammers that compares and tests 50 ways to protect email addresses (and web pages) published on a web site.

How to block spammer access to your web site

An email harvester (“spam robot” or “spambot”) is a web spider that crawls your web site looking for web pages containing email addresses.  To get a web page, the harvester sends a request for the page to your web server. Web browsers and legitimate search engine spiders make the same kinds of requests. If we could distinguish between requests from harvesters and those from legitimate programs, we could deny harvester requests and protect the web site. And if harvesters can’t access the site, they can’t add its email addresses to mailing lists and you’ll get less spam.

The information in the web page request is all the web server has for making a decision to deny a harvester access.  That page request contains the following for email harvesters, browsers, and web spiders alike:

  • Required information:
    • The visitor’s computer IP address (technically this is in the network packet, not the request, but it doesn’t really matter for this discussion).
    • The URL of the page to get.
  • Optional information:
    • “User-Agent”:  The browser/spider name and version number.
    • “Referer”: The URL, if any, for the page containing the link the visitor clicked on to get to the requested page.
    • “Accept”:  The type of web page formats the browser understands (such as text, HTML, and JPEG).
    • “Accept-Encoding”:  The types of file compression the browser understands (such as ZIP).
    • “Accept-Charset” and “Accept-Language”: The character set and language used by the visitor’s browser.
    • Login authorization.
    • And several even more esoteric items.

The Web-Sniffer.net home page can show you what a request looks like from your web browser.  At their site, type in a URL and press their “Submit” button. When I do this for “google.com”, my browser’s request looks like this:

GET / HTTP/1.1
Host: google.com
Connection: close
Accept-Encoding: gzip
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3
Referer: http://web-sniffer.net/

In English for humans, this says my browser wants Google’s home page, that I’m using U.S. English and the standard UTF-8 character set, that my browser is Mozilla Firefox on an Intel Mac, that it accepts a standard set of HTML and image file types, and that I’m making the page request from Web-Sniffer.net.

This information enables the web server to decide what type of web page to return to me (English HTML, for instance). We can perhaps use some of this to decide if the request is coming from an email harvester.

The language, character set, file formats, and compression schemes supported by the browser are not that useful for blocking spammers. The referer page is liable to be one of your own web pages when a spammer’s email harvester crawls your site and follows your links. Only the IP address and user agent have potential for detecting harvesters.

Below I discuss several methods to try and block spammer access to a site. After this list, I report the results of running these methods past a collection of email harvesters to see which methods are effective, and which are not.

Block access based upon the IP address

Every computer on the Internet has a unique IP (Internet Protocol) address. While a computer’s IP address can change from day to day, it normally does not. A spammer on a cable modem or DSL connection at a home or business might use the same IP address for months. With a special web server configuration, you can block spammers (or anybody) based upon their IP address.

Apache’s “Allow” and “Deny” directives from the “mod_access” module enable you to allow or block access for an IP address. The directives go in your server’s main configuration file, or into the “.htaccess” file for your site or any of its folders. If you have multiple addresses to block, you can have multiple “Deny” directives.

You can block access from a single IP address, such as “1.2.3.4”, or all IP addresses within a subnet, such as “1.2.3.” (ends with a dot but no number), or all IP addresses for a domain, such as “.example.com” (starts with a dot). Beware that using a domain name in a “Deny” directive requires that Apache do two DNS lookups for every page access, slowing down the web server.

Apache
Order Allow,Deny
Allow from all
Deny from 1.2.3.4
Deny from 1.2.3.
Deny from .example.com

Some content management systems, such as Drupal, include built-in features for blocking access based upon the IP address or user agent. The user interface for these features is usually better than editing an Apache configuration file.

You can build up a list of spammer IP addresses by watching your web server logs. A large number of page requests in rapid succession is probably a web spider.  If they aren’t coming from Google, Yahoo, MSN, or another legitimate search engine, then it’s probably a spambot crawling your site. To block further access by the spambot, add a “Deny” directive for its IP address to your server configuration.

Unfortunately, by the time your server log shows that your site has been harvested, it’s too late. Instead, you can use any of several “Domain Name System Black List”, or DNSBL services that provide lists of known spammer IP addresses. Wikipedia has a useful list of these services.

You can check if your own IP address is on any of these blacklists by visiting DNSBL.info and clicking on the “Check this IP” button. Chances are that you are on one of these lists, even if you are not a spammer. Several of these lists automatically add anybody that uses a dynamic IP address, which includes most home users. CipherTrust.com’s Zombie Stats web page reports that about 25% of the computers on the Internet are zombies controlled by spammers. Most of these are home computers, so it is reasonable for some DNSBL lists to include all home computers as a precaution. Obviously, don’t use a blacklist you are on or you’ll block your own access to your web site.

DNSBL services are intended for ISPs when they configure their domain name services. You can download these lists and use them yourself in your server configuration. There are also custom Apache modules that incorporate a DNSBL, such as mod_dnsbl and mod_access_rbl2. I have not tried either of these.

Block access based upon the user-agent

Instead of using an IP address blacklist, you can block access based upon the “User-Agent” part of each page request. The user agent tells the web server what type of browser or spider is asking for the page. For example, the user agent reported by Firefox for my Intel Mac reads:

Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3

The same browser on my Windows PC reports:

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3

Web spiders also report their names in the user agent text. Google’s googlebot spider reports:

Googlebot/2.1 (+http://www.google.com/bot.html)

UserAgentString.com maintains a list of user agent strings for dozens of browsers and spiders. The site’s home page will show you the user agent text for your own browser.

You can use Apache’s “BrowserMatch” or “BrowserMatchNoCase” directives from the “mod_setenvif” module to allow or block access for specific user agents. Either directive goes in your server’s main configuration file, or into any “.htaccess” file within your web site.

The “BrowserMatch” and “BrowserMatchNoCase” directives set an Apache environment variable if the user agent text matches a search phrase (“BrowserMatchNoCase” ignores differences in upper and lower case). Then you can use an “Allow” directive to allow access only if that variable has been set properly by a previous match.

You can add a directive to look for the user-agent text for each and every one of the many browsers available. Fortunately, there is a shortcut. Most browser programs are written to use one of a few standard web page handlers, such as “Gecko” (Camino, Firefox, Mozilla, Netscape), “KHTML” (Konqueror, OmniWeb, Safari), “MSIE” (Internet Explorer, AOL), and “Opera” (Opera, Wii).

The example here allows access only if the page request comes from one of these known browsers, and denies everything else:

Apache
Order Deny,Allow
BrowserMatchNoCase gecko allow_access
BrowserMatchNoCase khtml allow_access
BrowserMatchNoCase msie allow_access
BrowserMatchNoCase opera allow_access
BrowserMatchNoCase playstation allow_access
Deny from all
Allow from env=allow_access

If you want to allow access by search engine spiders, you’ll need to include robot user-agent names too.

Require a login to access the site

Instead of blocking access to your site based upon the IP address or user agent of a visitor (or spambot), you can use a login form to restrict access.  Legitimate visitors with accounts will have access to your site.  Everybody else gets blocked.

The Apache "mod_auth" module can protect a folder by presenting a login form when any page in the folder is requested. Login directives go in a “.htaccess” file in the folder to be protected. A “.htpassword” file in the site’s root folder (or anywhere you choose) includes login names and encrypted passwords.

You can use a login form to protect the entire site, or any portion of a site. This example protects all pages in a “protected” folder, such as a contact page of email addresses.

HTML <a href="protected/contact_page.html">Contact page</a>
Apache .htaccess AuthName "Site access"
AuthType Basic
AuthUserFile /somewhere/.htpasswd
Require valid-user
Apache .htpasswd my_user_name: my_encrypted_password

If you are using a content management system, such as Drupal or WordPress, then they have their own login mechanisms, along with features for managing accounts and selecting which pages are visible to which types of visitors.

Results

I tested 23 widely-available email harvesters to see how well these methods work to block their access to a test web site.  Each harvester was aimed at a test site containing email addresses. In the table below, a harvester gets a check mark if it made it past the block and got the protected email address.

All of the harvesters were tested on Windows XP SP2. The names of the harvesters are intentionally left off to avoid giving this web page search engine attention for spammers looking for the ”best” harvester to download.

Web site blocking test results
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Plain email address
Block access based upon the IP address                                              
Block access based upon the user-agent                
Require a login to access the site                                        

Every email harvester found the plain email address that was not protected.

Blocking access based upon an IP address blacklist stopped all of the email harvesters.

Blocking access based upon the user agent text only stopped a few email harvesters. Most of the harvesters lied in their user agent text, claiming to be a web browser and not a spambot. Many of these email harvesters even let the spammer select which type of web browser to pretend to be. With valid user agent text, the web server allowed the harvesters access to the test web site.

All of the email harvesters were initially stopped when a login was required. However, three email harvesters can automatically log in to a site if they are told the account name and password. To test these, I provided a valid account and all of them used it to log in and find the protected email address.

Conclusions

Blocking access based upon an IP address blacklist always works, but it may not be accurate. Most home users have dynamic IP addresses that can change from day to day or month to month. A spammer can force their own IP address to change by resetting their connection. The IP address the spammer was using will get reassigned to some other innocent user. If you block the IP address because a spammer was using it, now you’re blocking a new innocent user. The spammer has moved on to a new IP address that you probably aren’t blocking yet. Chasing spammer IP addresses is a never-ending game where the spammer is always in front. IP address blocking is not an effective way to stop the email harvesters used by spammers.

User-agent blocking isn’t effective either. It presumes that email harvesters honestly report that they are harvesters. Well, if spammers were honest, they wouldn’t be breaking the law by running email harvesters in the first place. So, of course they lie. Some email harvesters cover their tracks by randomly switching among user agent texts from page to page. Some will randomize the time between successive page requests as they crawl your site. Most will also limit the number of page requests they’ll make to the same site in one session. All of this makes an email harvester look like a legitimate site visitor, making it hard to block them without also blocking real visitors.

Site logins may or may not be effective at blocking spammers depending upon the site’s policies for granting accounts, monitoring their use, and keeping passwords secure. If a spammer can get an account, they can easily give the name and password to their spambot and it’ll automatically log in and harvest your site. For sites with a small number of accounts, logins can be effective to block unwanted access. But for larger sites, and particularly for community sites with an open policy on getting accounts, it’s hard to keep track of all the logins and react quickly if one has been stolen by a spammer. Logins are not a practical way to stop spammers.

Logins also require effort on your part to create and manage accounts, and they require effort for the user to remember their account name and password. Unless your site has valuable content, users will be annoyed and the account management hassle is probably not worth it.

Recommendation: blocking web site access based upon IP addresses, user agent names, or login accounts is not a very effective way to stop spammers and their email harvesters. The other articles in this series discuss better methods to protect individual email addresses on a page, or hide specific web pages, without trying to block access to the entire site.

Further reading

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting