How to deal with (block) semalt referer spam in your Analytics Data

New Update to this post on 5/3/15: Updated Referer spam blocking code.

I spent sometime reviewing the Google analytics data for my web properties, and a number of those I administer for clients and noticed that the problem of the false "semalt" bot visits, has continued to grow worse. If you have a public website you likely already know what I'm talking about. This bot browses sites, providing a fake user-agent, ignoring robots.txt, and sending bogus data in the http referral header. This causes tons of garbage results in any analytics platform, especially inflating the bounce rate, decreasing the average session time and many other metrics that we monitor. 

On one of my client's sites. Over a 1 week period there was 34 visits from semalt (or at least visits coming with "semalt" in the http referer header). They have nearly a 100% bounce rate (the percentage of visitors who visit the site and leave, rather than  viewing additional pages), essentially viewing 1 page per session, with out this bogus data this site has over a 3 pages/session average, and about a 50% bounce rate. Also since they are a bot, the session duration shows up at a big fat 0, where true visits are often in excess of 1 minute. As you can see the semalt visits make a huge impact to this site's averages. 

I decided this was enough. A few weeks ago I submitted one of my domains on their form, that they claim  will stop their bot from crawling a site. I did, at first, see a decrease in traffic from them to this domain, however after a week or so the traffic has increased again. I decided a better solution was needed. There are many posts out there about dealing with them through filters in Analytics, and these all work great, they fix the results for you. I, however, was too annoyed to just filter them out. I wanted revenge.

There are a number of approaches to dealing with them. It seems like the route most are taking is to set Google Analytics to ignore visits from semalt, which works, your stats, should return to normal. However in this case, semalt is still crawling your site as frequently as they were before you just aren't looking at them anymore. I call this the "head in the sand" approach. Pritesh Patel has written a great article about this approach and it should be simple to implement for almost anyone with out any technical knowledge. There is also a nice article by Kim Herrington over at Bear and Beagle on the same.

Another method involves using mod_rewrite in the site's vhost.conf or .htaccess file to return a 403 Forbidden when the referer contains "semalt.com". This approach has the benefit of limiting the bandwidth consumed by the semalt visits to almost zero. We can call this the "go away" approach. This simply matches the refer header to a search string, in this example "semalt\.com". The [NC] flag tells mod_rewrite to ignore the cAsE. Then the rule returns a 403 - Forbidden for any requests that match the condition. This does take some technical knowledge to implement and if done incorrectly could unintentionally restrict legitimate traffic to your site.
RewriteEngine on
RewriteCond %{HTTP_REFERER} semalt\.com [NC]
RewriteRule .* – [F]

My approach was a little different. I'll explain the steps I took to implement it below. We can refer to my approach as the "F^*& You!" approach. We are essentially taking a similar mod_rewrite condition like above, but instead of just returning forbidden to semalt, we proxy internally to an alternate vhost we have configured, so semalt never sees my real sites, and their garbage doesn't show up in my analytics accounts. I sat down and gave the matter some thought. I came up with a few solutions and tested a few ideas until I came up with a solution that would be easy to maintain, automatic (or at least little maintenance required), and most importantly a low impact on my servers. Since we are proxying the requests and sending data back there is some additional overhead on the server, but I felt this was a good trade off over having them crawl my full sites and the full site's of my clients.

My method requires 4 fairly standard apache2 modules: mod_rewrite, mod_proxy, mod_proxy_http, and mod_bw. Chances are you are already using at least one of these, I was using 3. Installing and enabling them is simple, but does require you to restart Apache once they are installed. For most standard installations it looks like rewrite, proxy, and proxy_http are already installed. So you need only enable them. mod_bw installs easily on ubuntu, and can likely adapt this for installing on other Linux hosting platforms using Apache (windows also, but is untested). Remember, you will either need to be root, or execute these commands using sudo.

apt-get install libapache2-mod-bw
a2enmod bw
a2enmod rewrite
a2enmod proxy
a2enmod proxy_http

Configuration: add a rewrite section to your vhost config for any site you wish to protect from semalt's shenanigans. If you have multiple vhosts you'll need to repeat this step for each. You'll notice below that I not only include the Rewrite Condition for Referer, but also checking against  a "hosts.deny" file that defines IP Addresses and host-names (This ruleset relies on HostNameLookups being set on, which can be a significant performance hit). With the rules as they are below, any IP Address define in the hosts.deny file or any

    # Rewrite Rules #####################
    LogLevel alert rewrite:trace2
    RewriteEngine On
    RewriteMap hosts-deny txt:/path/to/hosts.deny
    RewriteCond ${hosts-deny:%{HTTP:X-FORWARDED-FOR}|NOT-FOUND} !=NOT-FOUND [OR]
    RewriteCond ${hosts-deny:%{REMOTE_ADDR}|NOT-FOUND} !=NOT-FOUND [OR]
    RewriteCond ${hosts-deny:%{REMOTE_HOST}|NOT-FOUND} !=NOT-FOUND [OR]
    RewriteCond %{HTTP_REFERER} semalt\.com [NC]
    RewriteRule ^/(.*) http://localhost:8888/ [P]
    ProxyPassReverse / http://localhost:8888/
    # end Rewrite Rules #################

Create a new vhost on an alternate port. I used 8888. Configure a page to load. In /var/www/pita I created index.html and created a simple page with no dynamic resources (to save on cpu cycles and memory). Because this custom site doesn't share any code or content with our real sites we no longer have to worry about bogus analytics data, email address harvesting, comment spam, content theft or whatever the hell semalt is doing crawling my sites. We can also use this space to send a message to the bad bots, such as a cease and desist notice. I decided to present a simple Cease-and-Desist notice here (I've decided to not share the text of this document as this could be interpreted as legal advice). While not drafted by a lawyer, and likely not legally binding, I hope that semalt and any other bot, scammer or spammer that I redirect to this page will take notice.

In the configuration file for the *:8888 vhost I added the settings for mod_bw to slow down the vhost (since, in theory this vhost should only be returning data to our dear friends at semalt we can have this run much slower)

<VirtualHost *:8888>
DocumentRoot /var/www/pita
<Directory "/var/www/pita/">
Require all granted
Options +Indexes
</Directory>
LogFormat "%v %l %u %t \"%r\" %>s %b" comonvhost
ErrorLog /var/log/apache2/pita-error_log
LogLevel warn
TransferLog /var/log/apache2/pita-access_log
 
BandWidthModule On
ForceBandWidthModule On
BandWidth all 10000
</VirtualHost>

 

We don't want this to run so slow that it negativity impacts our servers, but slow enough that it slows up semalt's ability to make additional requests. If each request takes even a few additional seconds it will make a huge impact on semalt. Especially if many others adopt this technique. If you have had successes with this method, or came up with a method of your own I'd love to hear about it in the comments below.