I have to say first many others considered addressing this problem by just adding the X-Robots-Tag HTTP Extended header should the originating requester fail validation as real established crawler. I have to be against this because, those proxies that take websites down may be put on porpouse and may strip away this tag.
Actually, if I think better, if the proxies are put on porpouse for doing this there is no protection unless you are willing to verify any address that hits your site. Because when specially designed proxies hit sites they might alter the User-Agent as they are the hooks between you and real bots. So if they identify as real browsers and actually retrieve pages on behalf of crawlers the only method to protect yourself unless, as I said, you are willing to validate all visitors that hit your site.
Not really a viable solution but still … I do not think most spammers and black-hats have the financial disponibility to achieve such firepower. Don't take last statement for granted as where there's a will there's a way and where 5+ figures roll there's always will!
Several month ago all search engines provided methods of verifying their robots. So each crawler's IP resolves to a hostname which resolves back to the IP and the hostname has a specific name as you can read in the article linked above.
The moment I find something like this I quickly code it and shared the code to verify robots by agent and ip for free.
Still I found no explanation of such an unanimous decision. First it appeard as they wanted to encourage cloaking!? Cute … but I don't fall for these kind of traps.
Reading through forums I found the stories of some internet `superstars` that have been proxied out of their serps. Nice … and then after I read how their sites were leveled I connected it to the decision to provide methods of robots verification.
So you have the evil black-hat seo and he sees your site and says! You must go … you block my sunlight (search exposure)!
He has a list of public proxies. A public proxy is a webpage where you enter an URL and it allows you to visit that URL by hiding you behind itself. So that site loads the links for you and rewrites them so you can visit site1.com from site2.com without seeing any difference. You will feel like visiting site1.com, just that site1.com will think site2.com is visiting and not evil seo.
For example: visit this link to see how Altavista Babelfish Translator can take over Google.com. This is exactly what happened to the `superstars`!
By pointing 1000s of proxies to you he will clone your site all around the web dumbing googlebot to the point where it won't know which is you (your content's original owner). I think rank comes along here and big sites crush little ones. Read about my 302 hijacks back in the old days! The Algorythm may make one mistake and retire you from the serps. You will be replaced by the proxy. (Even banned!) That would not be cool! And this does happen when machines have to take decisions. They lack common sense …
Still this is not a 302 hijack. When 302 sends you elsewhere this hijacking brings that elsewhere to you unwillingly.
There are actually 2 things you can do! I will explain each of the below.
Keep in mind that the only difference between real googlebot and the proxy will be the IP address which will not resolve. User agents and headers are usually duplicated too.
1st thing: change all your links from root relative to site absolute. If you take a look in my source code you will see all link contain http://www.mysitename.com/. This has not been coded by me directly but I used a PHP output buffering tweak to change them when site is generated. If you expose internal links crawlers leeching proxies will continue to go on your site maximzing the damage. If you have absolute paths the proxy might not encode them and Google will not see your entire site. But a proxy should encode all links. If I coded it it would.
The fix above relies on dumb proxy coders. Not actually safe! Let's see what's behind door number 2.
2nd thing: verify user agents and crawler IP addresses. So you use the search engines exposed methods of verification to check them by Forward and Backward DNS. If verification fails send a 500 - Server Issues error. Bots should take a break and come back later and evil bots will not get anything. Why I say bots will … ? Because, if search engines mess up DNS in any way you will block them too ;) I see no fake robots trapped and I use this on site so … check your logs and stay proactive.
Use the scripts mentioned here and stay safer (You can never be safe but you can be safer!). And try not to use this to cloak content. You will eventually go down. I know as I've been there!
Post Feedback