Write
a Robots.txt File
One of
the most fundamental steps when optimizing a website is writing a robots.txt
file. It helps tell spiders what is useful and public for sharing in the search
engine indexes and what is not. In addition, a poorly done robots.txt file can
stop the search spiders from crawling and indexing your website properly. In
this article I will show you how to be sure everything will work correctly.Some SEO
would say that using robots.txt file would not improve your search engine
rankings, i would disagree with this point, many search engines have publicly
said to use robots.txt file. Here is a quote taken from google
"Make
use of the robots.txt file on your web server. This file tells crawlers which
directories can or cannot be crawled. Make sure it's current for your site so
that you don't accidentally block the
Googlebot crawler."
Also if
you read your stats file on your web hosting server, you will usually find the
URL to your robots.txt being requested. If a search bot asks for the robots.txt
and does not find it on your server, the spider often just leaves. Let us now
see how to build a robots.txt file
Write
a Robots.txt File - How Do I Build a Robots.txt?
After
opening Notepad (or another text editor), save the blank file as robots.txt.
The file must be placed on the root level of your webserver or in other words
the same folder where your index page exists ( index.php or index.html).The
text file is actually a list. Its directions consist of two fields, or lines of
instruction.Now here is an important part , there are two important lines :
The first
line is the User-agent line.
This is
the line where you can specify which search spider bots are allowed to index
your sites.
The
second line is the directive line or disallow field.
This is
the line you will use to block folders or files blocked from spiders.
Here a
question may arise why should we disallow certain folders or files, some
folders may be private or protected for users or visitors and they could stop
search spiders from indexing their pages in the folders which your have
specified.
To write
the robots.txt file, you would start by addressing specific search engines. The
User-agent line would start as:
User-agent:
Adding a
specific search engines spider name here will give the search spider notice
that it is to follow the next line for instruction, i.e.:
User-agent: googlebot
Now you
specify googlebot how to index your pages, what pages must be spidered and what
not to be.This tells googlebot that it is to follow the next line's directions
on how to proceed through your website, or to leave altogether.
The
second line known as the directive is written as:
Disallow:
By adding
a folder after the Disallow statement, the search spider should ignore the
folder for indexing purposes and move to others where there is no restriction.
Disallow: /downloads/
You can
also disallow specific files this way
Disallow: cheeseyporn.htm
If you
leave the Disallow directive line blank or not filled in, this indicates that
ALL files may be retrieved and or indexed by specifiedl robot(s). This would
let all robots index all files.
User-agent: *
Disallow:
And vice
versa you can keep all robots out easily.
User-agent: *
Disallow: /
Since the
root directory is blocked, none of the other folders and files can be indexed
or crawled. Your site will be removed from search engines once they read your
robots.txt and update their indexes.But i think no one would willing your site
to be removed from the search engines, until you are a gaint like yahoo or
microsoft who might feel that there is a lot of bandwidth wasted by allowing
your site indexed for regular updates.
You can
provide multiple Disallows to one User-agent. In the following example, all
spiders will be told not to index the cgi-bin and the images directories.
User-agent: *
Disallow: /downloads/
Disallow: /images/
If the pages are cleanly coded, this will
often result in improved rankings in all three of the major search
engines.After you have written your robots.txt file and placed it on your
server, you should validate it with one of the robots.txt validation tools
online.