After explainig the theory in detecting spoofed user agent crawlers now the practical method pops in. This can also be used for a really dirty trick : link exchange cloaker!
Most important thing about DNS lookups is they are slow. I mean they can get over 5 seconds. So this method without proper caching is simply unusable. You will make your site look so slow for the crawlers they might take action against you.
In order go over the speed issue I added mysql caching to this script. Make sure you have a mySql connection open before you run it! It runs without mysql but it will make DNS lookups on each request.
If you have no mysql access the script will not be viable to you! Convert it to write to files.
The class is very easy to use and will allow you to verify by IP or IP and user agent. Even if you will be allowed to verify only by IP do not verify just any IP visiting your site. That will make your site act like in replay.
So only safe method is to verify by UserAgent and IP. User Agent will be checked and then, if found matching robots, IP will also be checked. If IP fails verification you will be advised.
Data will be cached and kept in cache for a specified time (configurable in script). Both successes and failuers will be cached making this script go as fast as it can go.
Now checkout the script and below I'll explain more in depth.
This is the script that does the trick.
<?
// ------------------------------------------------------------------------
class eVerifyBot{
// ---------------------------------------------------------------------------
// -- Db Access Details
// ---------------------------------------------------------------------------
var $dbLink=0;
var $dbTable="dnsbotcache";
var $cacheExpire=7;
// --
var $webRobots = array();
// -- Internal Function
function addRobot($agentRegExp,$verifyHost,$returnCode){
array_push($this->webRobots,array($agentRegExp,$verifyHost,$returnCode));
}
// ---------------------------------------------------------------------------
//-- Initiate the country
// ---------------------------------------------------------------------------
function initBots(){
$this->addRobot("/(google|mediapartners)/i",".googlebot.com","google");
$this->addRobot("/(slurp)/i",".crawl.yahoo.net","yahoo");
$this->addRobot("/(msnbot)/i",".search.live.com","msn");
$this->addRobot("/(ask|teoma)/i",".ask.com","ask");
$this->addRobot("/(archiver)/i",".alexa.com","alexa");
}
function __construct($DbLink,$Table="dnsbotcache"){
if(get_resource_type($DbLink)=="mysql link"){
$this->dbLink =$DbLink;
}
$this->createTable();
$this->initBots();
}
// ---------------------------------------------------------------------------
//-- Initiate the country
// ---------------------------------------------------------------------------
function eVerifyBot($DbLink,$Table="dnsbotcache"){
return $this->__construct($Table,$DB,$User,$Pass,$Host);
}
// ---------------------------------------------------------------------------
// -- Create the table in the database
// ---------------------------------------------------------------------------
function createTable(){
$this->doQuery(
"CREATE TABLE IF NOT EXISTS `".$this->dbTable."` (
`DnsCacheID` int unsigned NOT NULL auto_increment,
`DnsCacheIP` varchar(15) NOT NULL,
`DnsCacheCC` varchar(3) NOT NULL default '',
`DnsCacheDate` datetime NOT NULL,
`DnsCacheHost` tinytext NOT NULL,
`DnsCacheSpoof` tinyint(1) NOT NULL default '0',
UNIQUE KEY `UNIQUES` (`DnsCacheIP`),
PRIMARY KEY (`DnsCacheID`)
) TYPE=MyISAM;"
);
$CacheLimit = time()-($this->cacheExpire*24*60*60);
$this->doQuery("DELETE FROM `".$this->dbTable."`
WHERE `DnsCacheDate`<FROM_UNIXTIME(".$CacheLimit.")
OR (`DNSCacheIP`=`DnsCacheHost`)");
}
// ---------------------------------------------------------------------------
// -- Internal functions for mysql queries!
// ---------------------------------------------------------------------------
function doQuery($query){
if(get_resource_type($this->dbLink)=="mysql link")
return mysql_query($query,$this->dbLink);
return mysql_query($query);
}
// ---------------------------------------------------------------------------
// -- Internal functions for mysql queries!
// ---------------------------------------------------------------------------
function getHostByIp($ipAddress){
//--
$res = $this->doQuery("SELECT * FROM `".$this->dbTable."`
WHERE `DnsCacheIP`='".$ipAddress."' LIMIT 1");
if(mysql_num_rows($res)){
$rec = mysql_fetch_assoc($res);
return (($rec['Spoof']==1) ? $ipAddress : $rec['DnsCacheHost']);
}
//--
$hostName = gethostbyaddr($ipAddress);
$revIpAddress = (($hostName==$ipAddress) ? $ipAddress : gethostbyname($hostName));
$spoof = ($revIpAddress != $ipAddress) || ($hostName==$ipAddress);
$this->doQuery("INSERT INTO `".$this->dbTable."`
(`DnsCacheIP`,`DnsCacheHost`,`DnsCacheDate`,`DnsCacheSpoof`)
VALUES
('$ipAddress','$hostName',NOW(),".(int)$spoof.")"
);
return ($spoof ? $ipAddress : $hostName);
}
// ---------------------------------------------------------------------------
// -- Internal functions for mysql queries!
// ---------------------------------------------------------------------------
function verifyBot($ipAddress,$verifyString){
$BotHost = $this->getHostByIp($ipAddress);
if($BotHost == $ipAddress) return false;
if(isset($verifyString)){
$CheckStr = strstr($BotHost,$verifyString);
return ($CheckStr == $verifyString);
}
reset($webRobots);
foreach($this->webRobots as $webRobot){
if(!preg_match("/".str_replace(".",'\.',$webRobot[1])."$/i",$BotHost)){
continue;
}
return $webRobot[2];
}
return false;
}
// ---------------------------------------------------------------------------
// -- Internal functions for mysql queries
// ---------------------------------------------------------------------------
function verifyAgent($ipAddress=false,$userAgent=false){
if(!is_string($userAgent)) $userAgent=$_SERVER['HTTP_USER_AGENT'];
if(!is_string($ipAddress)) $ipAddress=$_SERVER['REMOTE_ADDR'];
reset($webRobots);
foreach($this->webRobots as $webRobot){
if(preg_match($webRobot[0],$userAgent)){
if($this->verifyBot($ipAddress,$webRobot[1]))
return $webRobot[2];
return false;
}
}
return true;
}
// ---------------------------------------------------------------------------
}
// ------------------------------------------------------------------------
?>
Make sure you have a mysql connection open before using this script. Below is the way I integrated it in the site!
<?
// ---------------------------------------------------------------------------------
function httpError($num,$msg="HTTP Error"){
header("HTTP/1.1 $num Error Occured");
if(!isset($msg)){
$msg="Erro N/A";
}
echo "<title>$num - $msg</title>\n<pre>$num - $msg</pre>";
exit();
}
// ---------------------------------------------------------------------------------
// Cache is hard-coded in the class to 7 days. You can edit it there!
$verifyBot = new eVerifyBot();
// Without parameters this function will verify current visitor
// You can use an IP and Agent as parameters to test your own stuff!
$botID = $verifyBot->verifyAgent();
// If BotID is string then bot is cool!
// If BotID is true then no Bot valid User Agent was matched.
// If BotID is false then we have a spoof claiming to be a bot!
$botValid = ($botID===true) ? true : is_string($botID);
// -- If spoof detected send a 500. Means server problems and
// -- Real bots will stop and return later! Just in case.
if($botValid === false){ httpError(500, "Die Spoof!"); }
// ---------------------------------------------------------------------------------
?>
This will help you keep evil crawlers away. Still keep filters up for the violent ones who take too much too fast.