What is the robots.txt file?
The robots.txt file is an ASCII text file that has specific instructions for search engine robots about specific content that they are not allowed to index. These instructions are the deciding factor of how a search engine indexes your website’s pages. The universal address of the robots.txt file is: www.domain.com/robots.txt . This is the first file that a robot visits. It picks up instructions for indexing the site content and follows them. This file contains two text fields. Lets study this example :
User-agent: *
Disallow:
The User-agent field is for specifying robot name for which the access policy follows in the Disallow field. Disallow field specifies URLs which the specified robots have no access to. An example :
User-agent: *
Disallow: /
Here “*” means all robots and “/ ” means all URLs. This is read as, “ No access for any search engine to any URL” Since all URLs are preceded by “/ ” so it bans access to all URLs when nothing follows after “/ ”. If partial access has to be given, only the banned URL is specified in the Disallow field. Lets consider this example :
# Research access for Googlebot.
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /concepts/new/
Here we see that both the fields have been repeated. Multiple commands can be given for different user agents in different lines. The above commands mean that all user agents are banned access to /concepts/new/ except Googlebot which has full access. Characters following # are ignored up to the line termination as they are considered to be comments.