Verify search engine crawler by Backward and Forward DNS Lookup : 5ubliminal's TellinYa

<a href="http://www.tellinya.com/art2/68/">Verify search engine crawler by Backward and Forward DNS Lookup : 5ubliminal's TellinYa</a>
Must Reads: Web Scraping | Link Farming | Code Snippets | SEO Freeware
Reveal More!

After explainig the theory in detecting spoofed user agent crawlers now the practical method pops in. This can also be used for a really dirty trick : link exchange cloaker!

The DNS Crawler Identity Lookup Functionality

Most important thing about DNS lookups is they are slow. I mean they can get over 5 seconds. So this method without proper caching is simply unusable. You will make your site look so slow for the crawlers they might take action against you.

In order go over the speed issue I added mysql caching to this script. Make sure you have a mySql connection open before you run it! It runs without mysql but it will make DNS lookups on each request.

If you have no mysql access the script will not be viable to you! Convert it to write to files.

The class is very easy to use and will allow you to verify by IP or IP and user agent. Even if you will be allowed to verify only by IP do not verify just any IP visiting your site. That will make your site act like in replay.

So only safe method is to verify by UserAgent and IP. User Agent will be checked and then, if found matching robots, IP will also be checked. If IP fails verification you will be advised.

Data will be cached and kept in cache for a specified time (configurable in script). Both successes and failuers will be cached making this script go as fast as it can go.

Now checkout the script and below I'll explain more in depth.

The DNS Lookup PHP Script

This is the script that does the trick.

<?
// ------------------------------------------------------------------------
class eVerifyBot{
    
// ---------------------------------------------------------------------------
    // -- Db Access Details
    // ---------------------------------------------------------------------------
    
var    $dbLink=0;
    var    
$dbTable="dnsbotcache";
    var 
$cacheExpire=7;
    
// --
    
var $webRobots = array();
    
// -- Internal Function
    
function addRobot($agentRegExp,$verifyHost,$returnCode){
        
array_push($this->webRobots,array($agentRegExp,$verifyHost,$returnCode));
    }
    
// ---------------------------------------------------------------------------
    //-- Initiate the country
    // ---------------------------------------------------------------------------
    
function initBots(){
        
$this->addRobot("/(google|mediapartners)/i",".googlebot.com","google");
        
$this->addRobot("/(slurp)/i",".crawl.yahoo.net","yahoo");
        
$this->addRobot("/(msnbot)/i",".search.live.com","msn");
        
$this->addRobot("/(ask|teoma)/i",".ask.com","ask");
        
$this->addRobot("/(archiver)/i",".alexa.com","alexa");
    }
    function 
__construct($DbLink,$Table="dnsbotcache"){
        if(
get_resource_type($DbLink)=="mysql link"){
            
$this->dbLink    =$DbLink;
        }
        
$this->createTable();
        
$this->initBots();
    }
    
// ---------------------------------------------------------------------------
    //-- Initiate the country
    // ---------------------------------------------------------------------------
    
function eVerifyBot($DbLink,$Table="dnsbotcache"){
        return 
$this->__construct($Table,$DB,$User,$Pass,$Host);
    }
    
// ---------------------------------------------------------------------------
    // -- Create the table in the database
    // ---------------------------------------------------------------------------
    
function createTable(){
        
$this->doQuery(
            
"CREATE TABLE IF NOT EXISTS `".$this->dbTable."` (
                `DnsCacheID`        int unsigned NOT NULL auto_increment,
                `DnsCacheIP`        varchar(15) NOT NULL,
                `DnsCacheCC`        varchar(3) NOT NULL default '',
                `DnsCacheDate`        datetime NOT NULL,
                `DnsCacheHost`        tinytext NOT NULL,
                `DnsCacheSpoof`        tinyint(1) NOT NULL default '0',
                UNIQUE KEY `UNIQUES` (`DnsCacheIP`),
                PRIMARY KEY  (`DnsCacheID`)
            ) TYPE=MyISAM;"
        
);
        
$CacheLimit time()-($this->cacheExpire*24*60*60);
        
$this->doQuery("DELETE FROM `".$this->dbTable."`
            WHERE `DnsCacheDate`<FROM_UNIXTIME("
.$CacheLimit.")
            OR (`DNSCacheIP`=`DnsCacheHost`)"
);
    }
    
// ---------------------------------------------------------------------------
    // -- Internal functions for mysql queries!
    // ---------------------------------------------------------------------------
    
function doQuery($query){
        if(
get_resource_type($this->dbLink)=="mysql link")
            return 
mysql_query($query,$this->dbLink);
        return 
mysql_query($query);
    }
    
// ---------------------------------------------------------------------------
    // -- Internal functions for mysql queries!
    // ---------------------------------------------------------------------------
    
function getHostByIp($ipAddress){
        
//--
        
$res $this->doQuery("SELECT * FROM `".$this->dbTable."`
            WHERE `DnsCacheIP`='"
.$ipAddress."' LIMIT 1");
        if(
mysql_num_rows($res)){
            
$rec mysql_fetch_assoc($res);
            return ((
$rec['Spoof']==1) ? $ipAddress $rec['DnsCacheHost']);
        }
        
//--
        
$hostName gethostbyaddr($ipAddress);
        
$revIpAddress = (($hostName==$ipAddress) ? $ipAddress gethostbyname($hostName));
        
$spoof = ($revIpAddress != $ipAddress) || ($hostName==$ipAddress);
        
$this->doQuery("INSERT INTO `".$this->dbTable."`
            (`DnsCacheIP`,`DnsCacheHost`,`DnsCacheDate`,`DnsCacheSpoof`)
            VALUES
            ('$ipAddress','$hostName',NOW(),"
.(int)$spoof.")"
        
);
        return (
$spoof $ipAddress $hostName);
    }
    
// ---------------------------------------------------------------------------
    // -- Internal functions for mysql queries!
    // ---------------------------------------------------------------------------
    
function verifyBot($ipAddress,$verifyString){
        
$BotHost $this->getHostByIp($ipAddress);
        if(
$BotHost == $ipAddress) return false;
        if(isset(
$verifyString)){
            
$CheckStr strstr($BotHost,$verifyString);
            return (
$CheckStr == $verifyString);
        }
        
reset($webRobots);
        foreach(
$this->webRobots as $webRobot){
            if(!
preg_match("/".str_replace(".",'\.',$webRobot[1])."$/i",$BotHost)){
                continue;
            }
            return 
$webRobot[2];
        }
        return 
false;
    }
    
// ---------------------------------------------------------------------------
    // -- Internal functions for mysql queries
    // ---------------------------------------------------------------------------
    
function verifyAgent($ipAddress=false,$userAgent=false){
        if(!
is_string($userAgent)) $userAgent=$_SERVER['HTTP_USER_AGENT'];
        if(!
is_string($ipAddress)) $ipAddress=$_SERVER['REMOTE_ADDR'];
        
reset($webRobots);
        foreach(
$this->webRobots as $webRobot){
            if(
preg_match($webRobot[0],$userAgent)){
                if(
$this->verifyBot($ipAddress,$webRobot[1]))
                    return 
$webRobot[2];
                return 
false;
            }
        }
        return 
true;
    }
    
// ---------------------------------------------------------------------------
}
// ------------------------------------------------------------------------
?>

How you use it!

Make sure you have a mysql connection open before using this script. Below is the way I integrated it in the site!

<?
// ---------------------------------------------------------------------------------
function httpError($num,$msg="HTTP Error"){
    
header("HTTP/1.1 $num Error Occured");
    if(!isset(
$msg)){
        
$msg="Erro N/A";
    }
    echo 
"<title>$num - $msg</title>\n<pre>$num - $msg</pre>";
    exit();
}
// ---------------------------------------------------------------------------------
// Cache is hard-coded in the class to 7 days. You can edit it there!
$verifyBot = new eVerifyBot();
// Without parameters this function will verify current visitor
// You can use an IP and Agent as parameters to test your own stuff!
$botID $verifyBot->verifyAgent();
// If BotID is string then bot is cool!
// If BotID is true then no Bot valid User Agent was matched.
// If BotID is false then we have a spoof claiming to be a bot!
$botValid = ($botID===true) ? true is_string($botID);
// -- If spoof detected send a 500. Means server problems and
// -- Real bots will stop and return later! Just in case.
if($botValid === false){ httpError(500"Die Spoof!"); }
// ---------------------------------------------------------------------------------
?>

This will help you keep evil crawlers away. Still keep filters up for the violent ones who take too much too fast.

Post Feedback 
Name *
Mail *
URL
« Anti-Spam
» URL will only go live after a review. Comments are moderated. «
5ubliminal's TellinYa.com SEO & SEM Blog © 2007 - All rights reserved unless mentioned otherwise .
Rendered On : [Sunday, 06 July, 2008 - 04:22:15 GMT]   No Ajax / Flash Used Here
" Verify search engine crawler by Backward and Forward DNS Lookup : 5ubliminal's TellinYa "