5ubliminal@twitter

Verify Search Engine Crawlers - Spot and Block Fake User Agent Robots : 5ubliminal's TellinYa

<a href="http://www.tellinya.com/art2/67/">Verify Search Engine Crawlers - Spot and Block Fake User Agent Robots : 5ubliminal's TellinYa</a>
Must Reads: Web Scraping | Link Farming | Code Snippets | SEO Freeware » I'm back to work … sorting my shit now :)
Reveal More!
Why validating search engines web robots is important?

For every webmaster a googlebot visit to his website is a small event. So every webmaster will try to be as liberal regarding what he allows crawlers as googlebot, msnbot and yahoo!slurp as possible.

Each of this robots will identify using a special User Agent string which is known by everyone. So once they have the User Agent 'put on' we automatically think they are the real deal. But is the User-Agent Real or Fake?

But believe or not many out there browse sites claiming to be googlebot in order to gain advantages over competition and pass undetected. This is both a security hazard as well as a bandwidth one.

The examples when validation is mandatory:

See below some real life examples when validation is mandatory and how lack of validation can cause serious problems. Googlebot validation saves lives. MSNbot and Yahoo! Slurp validation also.

What if you have sensitive content you want indexed but not directly available?

Let's say you offer a membership site. You have quality content that needs to be found in order to attract costumers but you will only share it for a fee.

In order to have clients you need search engine visitors so you need the sensitive content indexed. So you put nocache in robots meta tag and you just feed the page to the visitors who match the user agent profile.

But the user-agent is by far most spoofed attribute in the HTTP protocol. So anyone with basic knowledge can use PHP and cUrl to retrieve your pages claiming to be a well known robot.

This way your membership sensitive content will be available to anyone claiming to be a webrobot you trust.

What if you are cheap on bandwidth?

Some web hosting services are not really that generous with bandwidth. And on the other hand you have several hungry robots hitting you as hard as you can take it and putting great pressure on your rusty webserver.

I bet you would love to find the fake ones and send them on the other side where they belong and this way saving you valuable bandwidth.

Legitimate usage of user agent spoofing

I do not believe in such thing. When someone claims to be someone he's not, he's not usually doing it to help you or for your benefit. Any web crawler claiming to be an established and trusted one is 100% up to no good!

Even when a coder tests his site using fake user agents it means he has something to hide and tests his site for cloaked content.

How can you detect the fakes?

Only way to answer this is: I don't know! You must verify it to issue a definitive opinion on the validity of the identification data your visitor is using.

All major search engines have a simple feature. To verify Googlebot, msnbot, Slurp or Ask follow the next 2 easy steps. The procedure is easy and goes like this:

  1. First you verify if the hostname of their ip address has their domain name in body
  2. Then you verify if the hostname found on step one translates back to the original ip address

Plain and simple. So what should you expect to find in the hostname?

  • Googlebot - will end in .googlebot.com
  • MSNBot - will end in .search.live.com
  • Yahoo!Slurp - will end in .crawl.yahoo.net
  • ASK - will end in .ask.com
  • IAArchiver - will end in .alexa.com

If the host passes the above verification it means it is a trustable webrobot. If a webrobot contains a fake user agent it will not pass the test above so block it.

And to make your life really easy …

I create a simple Bot Verification PHP Script By DNS Lookup. Obviously data is cached if a mysql connection is available this way avoiding slowdowns. This can also be used for a really dirty trick : link exchange cloaker!

Keep in mind nothing is 100% guaranteed so if you have not really legit propouses of using this you might / will go down!

Post Feedback 
Name *
Mail *
URL
« Anti-Spam
» URL will only go live after a review. Comments are moderated. «
5ubliminal's TellinYa.com SEM & SEO Blog © 2007 - All rights reserved unless mentioned otherwise .
Rendered On : [Tuesday, 07 October, 2008 - 12:49:52 GMT]   No Ajax / Flash Used Here
" Verify Search Engine Crawlers - Spot and Block Fake User Agent Robots : 5ubliminal's TellinYa "