My site heavily relies on .htaccess and every now and then I tweak it based on visitor behaviour. When you tweak live sites you are bound for disaster and one day doing a tweak saw in my logs that Googlebot hit me five times during that time.
The site issued 403 in return to the requests and my blood froze. 404 = Forbidden Access and according to W3 http specs the resource SHOULD NOT be visited again.
I think you can feel my pain here as I had to relocate the 3 URLs and place redirections and tweak much more as the whole is built on a structure tough to alter at URL level.
I took a deep breath and just before getting to work had an idea:
I should ask others and get some feedback before diving in. So I started this thread on WebMasterWorld.com. And I did get the valuable input.
After 2 more days of analyzing logs I finally reached my conculsions.
301 is the only safe way to redirect and avoid problems with web robots hence search engines. When you send out a 301 you actually tell the following to the web crawler:
Follow this URL and index the destination page with its own URL. I'm giving full credit to that page so that page is the one important for you.
To explain: if you have 2 URLs http://domain/url1 that send a 301 to http://domain/url2 when you will search in the search engines you will find http://domain/url2 indexed with its own content. http://domain/url1 will pass all its credit to http://domain/url2.
This code is tricky and should always be used with caution. Never use it offsite. So make sure no page of yours links to an external page outside your domain with a 302. This is as dangerous as it gets. So, in order to understand how the 302 works read what a 302 actually tells a search engine crawler robot.
Follow this URL and index the destination page with this URL. I'm not giving full credit to that page and I might change my opinion anytime. So keep my URL with that pages content.
To explain: if you have 2 URLs http://domain/url1 that send a 302 to http://domain/url2 when you will search in the search engines you will find http://domain/url1 indexed with the contents of http://domain/url2. http://domain/url1 will just borrow content from http://domain/url2.
Warning: here lies the problem. If you 302 to an offsite page then you simply claim that pages content. But, as you do not own that site, you will be penalized even banned as this is content theft. So keep the 302s insite.
True Story: This is a story of my experience back in 2005 with 302 code. I managed to outrank some new sites by sending a 302 to them and gaining more links than they had to my redirected URL. Later on, I was banned ;)
This code is tricky and according to specs very dangerous! Read below to understand what this tells a search engine crawler:
This URL exists but I don't want you anywhere near it. So take a hike, forget about it and don't come back.
But the search crawlers and their engineers are smart and don't rely on us too much regarding HTTP codes. So when you issue a 403 they will keep coming back and test the URL regularly. This is great as 403 can be issued by mistake and kill a URL according to specs.
But what if the domain was previously owned and 403ed at root level. This would block any new owners in eyes of the beholder (search engines).
So the search engines anticipated these situations and will keep on coming. So relax.
And by the way, after I issued my 403s googlebot hit those pages the very next day!
This code should be used when an URL is unavailable. It means page not found but does not imply the fact that it does not exist. So this code tells the search engine:
This page was not found but there is no guarantee that the URL does not exist. Or maybe I have some problems. Try again later maybe it'll return.
So this URL will too be crawled regularly.
This http status code is tricky and according to specs very dangerous! Read below to understand what this tells a search engine crawler:
This page existed but no longer exists. You should not visit this address anymore.
But, as for the 403, same exceptions occur. And crawlers will visit the page often too.
When your site is experiencing changes or problems this is the code to issue. Unlike the rest this works on server level and when issued this is what it tells the robots:
I'm having a bad hair day. Not just the requested URL but the entire site. I'm having problems so come back later.
The robot will stop any crawling when sent a 500 http error and will return later. This does not only block access to crawled URL but will postpone crawling with a certain duration after which it will return to your site.
When your site is experiencing overload this is the code to issue. Like 500 this works on server level and when issued it tells the robots:
Everybody loves this site and so much love is overwhelming. Please return later as I'm handling all the love I can handle right now.
The robot will stop any crawling when sent a 503 http error and will return later. This does not only block access to crawled URL but, as 500, will postpone crawling also.
Even if you issue the 403 forbidden codes on porpouse the robots will keep visiting you. So, in order to block access to some pages of your site, you must use robots.txt exclusion protocol. That is the only way to prevent URLs from being indexed but also from being hit by robots.
Any error code will not stop that URL from being visited. They will only stop it from being indexed but in order to save bandwidth and stop the hits on those pages you must use the robots.txt wisely.
Hope this has been of some help and shed some more light in the darkness of SEO.
Post Feedback