Optimization of the robots.txt file
The Right Commands in robots.txt : Use correct commands. Most common errors include - putting the command meant for “User-agent” field in the “Disallow field” and vice-versa.
Please also note that there is no “Allow” command in the standard robots.txt protocol. Content not blocked in the “Disallow” field is considered allowed. Currently, only two fields are recognized: “The User-agent field” and the “Disallow field”. Experts are considering the addition of more robot recognizable commands to make the robots.txt file more Webmaster and robot friendly.
Note: Google is the only search engine which is experimenting with certain new
robots.txt commands. It recognises the "allow" command. Please read more details on the google site for robots.txt usage.
Bad Syntax: Do not put multiple file URLs in one Disallow line in the robots.txt file. Use a new Disallow line for every directory that you want to block access to. Incorrect Robots.txt
example :
User-agent: *
Disallow: /concepts/ /links/ /images/
Correct robots.txt example:
User-agent: *
Disallow: /concepts/
Disallow: /links/
Disallow: /images/
Files and Directories: If a specific file has to be disallowed, end it with the file extension and without a forward slash in the end. Study the following robots.txt example :
For file:
User-agent: *
Disallow: /hilltop.html
For Directory:
User-agent: *
Disallow: /concepts/
Remember if you have to block access to all files in the directory, you don’t have to specify each and every file in robots.txt . You can simply block the directory as shown above. Another common error is leaving out the slashes altogether. This would leave a very different message than intended.
The Right Location for the robots.txt file: No robot will access a badly placed robots.txt file. Make sure that the location is www.domain.com/robots.txt.
Capitalization in robots.txt : Never capitalize your syntax commands. Directory and filenames are case sensitive in Unix platforms. The only capitals used per standard are: “User-agent ” and “Disallow ”
Correct Order for robots.txt : If you want to block access to all but one or more than one robot, then the specific ones should be mentioned first. Lets study this robots.txt example :
User-agent: *
Disallow: /
User-agent: MSNbot
Disallow:
In the above case, MSNbot would simply leave the site without indexing
after reading the first command. Correct syntax is:
User-agent: MSNbot
Disallow:
User-agent: *
Disallow: /
The robots.txt file : Not having a robots.txt file at all could generate a 404 error for search engine robots, which could redirect the robot to the default 404-error page or your customized 404-error page. If this happens seamlessly, it is up to the robot to decide if the target file is a robots.txt file or an html file. Typically it would not cause many problems but you may not want to risk it. It’s always a better idea to put the standard robots.txt file in the root directory, than not having it at all.
The standard robots.txt file for allowing all robots to index all pages is:
User-agent: *
Disallow:
Using # Carefully in the robots.txt file: Adding comments after the syntax commands is not a good idea using “#”. Some robots might misinterpret the line although it is acceptable as per the robots exclusion standard. New lines are always preferred for comments.