Robot.txt

Webmaster, website owners  or developers usually give instructions to the search engines or search bots regarding their accessibility of the webpages withing their site; this mechanism is called The Robots Exclusion Protocol.

How does it works?

Suppose that you have submitted your website for indexing in a search engine. And robot while crawling your website would like to visit a page https://www.yoursitename.com/admin.html before it crawl , it first check for restrictions at https://yoursitename.com/robots.txt and finds.

User-agent: *
Disallow: /

It specifies that search bots are disallowed to visit any pages of this website.

There are two very important facts you should know while using /robots.txt:

  • Malware robots  or some search engine bots could ignore your /robots.txt which scan the web for security vulnerabilities, and email address harvesters applications used by spammers will pay no attention.
  • the /robots.txt file is a public type of file so anyone can see what sections or pages of your website you don’t want robots to crawl or index.

It is advisable don’t ever try to use /robots.txt to hide your important information. Because people are cleaver enough to look for Robot.txt file.

How to create a /robots.txt file

Where to put it

The short answer: in the top-level directory of your web server.

The longer answer:

When a robot looks for the “/robots.txt” file for URL, it strips the path component from the URL (everything from the first single slash), and puts “/robots.txt” in its place.

For example, for “http://www.example.com/shop/index.html, it will remove the “/shop/index.html“, and replace it with “/robots.txt“, and will end up with “http://www.example.com/robots.txt”.

So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site’s main “index.html” welcome page. Where exactly that is, and how to put the file there, depends on your web server software.

Remember to use all lower case for the filename: “robots.txt“, not “Robots.TXT.

What to put in it

The “/robots.txt” file is a text file type, which could have one or many instructions.  Like

User-agent: *
Disallow: /cgi-bin/
Disallow: /admin/
Disallow: /js/

In the given example we could see that the webmaster has has given instruction to the bots to not  crawl, index or visit these three directories.

Note:-

1.)You can’t restrict all your directories or pages with a single line Disallow like “Disallow: /admin/ /user/”. You have to write separate Disallow line for each instructions.

2.) The ‘*’ in the User-agent field holds a special meaning to specify that any type of bots.

Here are some examples to help you understand it better:

To exclude all robots from the entire server
User-agent: *
Disallow: /

To allow all robots complete access
User-agent: *
Disallow:

(or just create an empty “/robots.txt” file, or don’t use one at all)

To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /Admin/
Disallow: /junk/
To exclude a single robot
User-agent: ManhonBot
Disallow: /
To allow a single robot
User-agent: Bing
Disallow:

User-agent: *
Disallow: /
To exclude all files except one

This is currently a bit awkward, as there is no “Allow” field. The easy way is to put all files to be disallowed into a separate directory, say “stuff”, and leave the one file in the level above this directory:

User-agent: *
Disallow: /~Doe/content/

Alternatively you can explicitly disallow all disallowed pages:

User-agent: *
Disallow: /~Doe/junkies.html
Disallow: /~Doe/foodies.html
Disallow: /~Doe/Barbary.html

Source: robotstxt.org

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *