How to create a robot.txt file

What is a robot.txt file?

A robot.txt is a file that tells various search engines not to crawl on certain website pages. Basically, it blocks search engines from seeing what certain web pages on your site. It’s placed on your website server and can be created in your Notepad on your computer.

Why do I need it?

Very simply, search engines look at your robot.txt file when they look at your site. So if you don’t have one or if you don’t need to hide pages, it is best to have one for Search Engine Optimization to boost ranks. Having a robot.txt file located in your website is better than not having one at all.

But you may need a robot.txt file because:

-you building parts of your site and do not want your unfinished work to appear in search engines

-you have information on parts of your page that is of no interest to the public

-you want parts of your website server directories hidden, like cgi-bin, files that don’t need to be looked at.

-you can select which search engines you want to block or appear in. You should do this because certain small search engines purpose is to only collect email addresses for spamming.

But overall, you need a robot.txt file because larger search engines look for this file first like Google, Bing, and Yahoo.

How can you create an effective robots.txt file for your site?

Simply create a text document and save the new document as robots.txt.  Do not use an html editor to create the file unless is has the ability to create a plain text document (ASCII). Most computers will allow you to create a text document using notepad.

  • Right click on your desktop
  • Choose new
  • Choose text document
  • Open the document you just created
  • Insert instructions to robots
  • Click on save as
  • Save document as robots.txt

Once you have uploaded the file to the root directory of your domain it’s good idea to use a robots.txt validator to confirm that everything is correct. You can search Google for free robots.txt validator or try the one listed below.

http://tool.motoricerca.info/robots-checker.phtml

If you would like more information, visit the Official The Robots Exclusion Protocol Website about Robot.txt: http://www.robotstxt.org

_____________________________________________________________________

Robot.txt file – Codes to use

There is nothing difficult about creating a basic robots.txt file. It can be created using notepad or whatever is your favorite text editor. Each entry has just two lines:

User-Agent: [Spider or Bot name] Disallow: [Directory or File Name]

This line can be repeated for each directory or file you want to exclude, or for each spider or bot you want to exclude.

Here are a few examples that will make it clearer.

1. Exclude a file from an individual Search Engine

You have a file, privatefile.htm, in a directory called ‘private’ that you do not wish to be indexed by Google. You know that the spider that Google sends out is called ‘Googlebot’. You would add these lines to your robots.txt file:

User-Agent: Googlebot Disallow: /private/privatefile.htm

2. Exclude a section of your site from all spiders and bots

You are building a new section to your site in a directory called ‘newsection’ and do not wish it to be indexed before you are finished. In this case you do not need to specify each robot that you wish to exclude, you can simply use a wildcard character, ‘*’, to exclude them all.

User-Agent: * Disallow: /newsection/

Note that there is a forward slash at the beginning and end of the directory name, indicating that you do not want any files in that directory indexed.

3. Allow all spiders to index everything

Once again you can use the wildcard, ‘*’, to let all spiders know they are welcome. The second, disallow, line you just leave empty, that is your disallow from nowhere.

User-agent: * Disallow:

4. Allow no spiders to index any part of your site

This requires just a tiny change from the command above – be careful!

User-agent: * Disallow: /

If you use this command while building your site, don’t forget to remove it once your site is live!

Let’s take this one in stages!

1. First you would ban all search engines from the directories you do not want indexed at all:

User-agent: * Disallow: /cgi-bin/ Disallow: /_borders/ Disallow: /_derived/ Disallow: /_fpclass/ Disallow: /_overlay/ Disallow: /_private/ Disallow: /_themes/ Disallow: /_vti_bin/ Disallow: /_vti_cnf/ Disallow: /_vti_log/ Disallow: /_vti_map/ Disallow: /_vti_pvt/ Disallow: /_vti_txt/

It is not necessary to create a new command for each directory, it is quite acceptable to just list them as above.

2. The next thing we want to do is to prevent Alta Vista from getting in there at all. The Altavista bot is called Scooter.

User-Agent: Scooter Disallow: /

This entry can be thought of as an amendment to the first entry, which allowed all bots in everywhere except the defined files. We are now saying we mean all  bot can index the whole site apart from the directories specified in 1 above,  except Scooter which can index nothing.

3. Now you want to keep Google away from those images. Google grabs these images with a sperate bot from the one that indexes pages generally, called Googlebot-Image. You have a couple of choices here:

User-Agent: Googlebot-Image Disallow: /images/

That will work if you are very organized and keep all your images strictly in the images folder.

User-Agent: Googlebot-Image Disallow: /

This one will prevent the Google image bot from indexing any of your images, no matter where they are in your site.

4. Finally, you have two pages called content1.html and content2.html, which are optimized for Google and Lycos respectively. So, you want to hide content1.html from Lycos (The Lycos spider is called T-Rex):

User-Agent: T-Rex Disallow: /content1.html

and content2.html from Google.

User-Agent: Googlebot Disallow: /content2.html




Discussion

  • Camie Tams said on June 1, 2010

    Hello,just discovered your Blog when i google something and wonder what webhosting do you use for your wordpress,the speed is more faster than my website, i really want to know it.will back to check it out,thanks!

    Like or Dislike: Thumb up 0 Thumb down 0




Speak Up

Your email address will not be published. Required fields are marked *


*

*