All about Robots.txt and Robots META tag

What is robots.txt?

www.google.com/robots.txt is the robots.txt of Google.

The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard is different from, but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites.

Robots, including search indexing tools and intelligent agents, should check a special file in the root of each server called robots.txt, which is a plain text file (not HTML). Robots.txt implements the REP (Robots Exclusion Protocol), which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. Web administrators can Allow access to their web content and Disallow access to cgi, private and temporary directories, for example, if they do not want pages in those areas indexed.

In June, 2008 webwide search engine companies Yahoo, Google, and Microsoft agreed to extend the Robots Exclusion Protocol. They added elements to robots.txt: an Allow directive, wildcards in URLs, and a link to a sitemap for ease of crawling, IP authentication to identify search engine indexing robots, the X-Robots-Tag header field for non HTML documents, and some additional META robot tag attributes.

How it can be used?

If a site owner wishes to give instructions to web robots they must place a text file called robots.txt in the root of the web site hierarchy (e.g. www.example.com/robots.txt). This text file should contain the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist web robots assume that the web owner wishes to provide no specific instructions.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories in their search. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data.

For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com.

www.google.com/robots.txt is the robots.txt of Google.

How to create a /robots.txt file?

Where to put it?

The short answer: in the top-level directory of your web server.
The longer answer: When a robot looks for the "/robots.txt" file for URL, it strips the path component from the URL (everything from the first single slash), and puts "/robots.txt" in its place.
For example, for "http://www.example.com/shop/index.html, it will remove the "/shop/index.html", and replace it with "/robots.txt", and will end up with "http://www.example.com/robots.txt".
So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software.
Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT.

What to put in it?

The "/robots.txt" file is a text file, with one or more records. Usually contains a single record looking like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

In this example, three directories are excluded.

Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".

What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:

To exclude all robots from the entire server

User-agent: *
Disallow: /

To allow all robots complete access

User-agent: *
Disallow:

(or just create an empty "/robots.txt" file, or don't use one at all)

To exclude all robots from part of the server

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

To exclude a single robot

User-agent: BadBot
Disallow: /

To allow a single robot

User-agent: Google
Disallow:

User-agent: *
Disallow: /

To exclude all files except one

This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:

User-agent: *
Disallow: /~joe/stuff/

Alternatively you can explicitly disallow all disallowed pages:

User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

To Reduce the Number of Requests the Search Crawler Makes on Your Site

If you occasionally get high traffic from search crawlers, you can specify a crawl delay parameter in the robots.txt file to specify how often, in seconds, search crawlers can access your website. To do this, add the following syntax to your robots.txt file:

User-agent: *

Crawl-delay: 10

Individual crawler sections override the settings that are specified in * sections. If you've specified Disallow settings for all crawlers, you must add the Disallow settings to the search crawler section you create in the robots.txt file. For example, your robots.txt file might have the following:

User-agent: *

Disallow: /private/

If you add a specific search crawler section, you must add any Disallow settings to that section.

For example:

User-agent: *

Crawl-delay: 10

Disallow: /private/

Disadvantages of using robots.txt in websites:

Unfortunately, there are two big problems with robots.txt:

Rate control

You can only specify what paths a bot is allowed to spider. Even allowing just the plain page area can be a huge burden when two or three pages per second are being requested by one spider over two hundred thousand pages.

Some bots have a custom specification for this; Inktomi responds to a "Crawl-delay" line which can specify the minimum delay in seconds between hits. (Their default is 15 seconds.)
Evil bots

Sometimes a custom-written bot isn't very smart or is outright malicious and doesn't obey robots.txt at all (or obeys the path restrictions but spiders very fast, bogging down the site). It may be necessary to block specific user-agent strings or individual IPs of offenders.

More generally, request throttling can stop such bots without requiring your repeated intervention.

Robots.txt architecture:

The exact mixed-case directives may be required, so be sure to capitalize Allow: and Disallow: , and remember the hyphen in User-agent:
An asterisk (*) after User-agent:: means all robots. If you include a section for a specific robot, it may not check in the general all robots section, so repeat the general directives.
The user agent name can be a substring, such as "Googlebot" (or "googleb"), "Slurp", and so on. It should not matter how the name itself is capitalized.
Disallow tells robots not to crawl anything which matches the following URL path
Allow is a new directive: older robot crawlers will not recognize this.
URL paths are often case sensitive, so be consistent with the site capitalization
The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
In the original REP directory paths start at the root for that web server host, generally with a leading slash (/). This path is treated as a right-truncated substring match, an implied right wildcard.
One or more wildcard (*) characters can now be in a URL path, but may not be recognized by older robot crawlers
Wildcards do not lengthen a path -- if there's a wildcard directive path that's shorter, as written, than one without a wildcard, the one with the path spelled out will generally override the one with the wildcard.
Sitemap is a new directive for the location of the Sitemap file
A blank line indicates a new user agent section.
A hash mark (#) indicates a comment

Some of robot.txt entries: (see google.com/robots.txt for real-time example)

Entry	Meaning
`User-agent: Disallow:`*	Because nothing is disallowed, everything is allowed for every robot
`User-agent: mybot Disallow: /`	Specifically, the mybot robot may not index anything, because the root path (/) is disallowed.
`User-agent: Allow: /`*	For all user agents, allow everything ( 2008 REP update)
`User-agent: BadBotAllow: /About/robot-policy.html Disallow: /`	The BadBot robot can see the robot policy document, but nothing else. All other user-agents are by default allowed to see everything. This only protects a site if "BadBot" follows the directives in robots.txt
`User-agent: Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private`*	In this example, all robots can visit the whole site, with the exception of the two directories mentioned and any path that starts with `private` at the host root directory, including items in `privatedir/mystuff` and the file `privateer.html`
`User-agent: BadBot Disallow: /` `User-agent: Disallow: //private/`*	The blank line indicates a new "record" - a new user agent command. BadBot should just go away. All other robots can see everything except any subdirectory named "private" ( using the wildcard character)
`User-agent: WeirdBotDisallow: /links/listing.htmlDisallow: /tmp/ Disallow: /private/` `User-agent: Allow: / Disallow: /temp* Allow: temperatureDisallow: /private/`*	This keeps the WeirdBot from visiting the listing page in the links directory, the tmp directory and the private directory. All other robots can see everything except the temp directories or files, but should crawl files and directories named "temperature", and should not crawl private directories. Note that the robots will use the longest matching string, so `temps` and `temporary` will match the Disallow, while `temperatures` will match the Allow. If you think this is inefficient, you're right.

Robots <META> tag:

In addition to server-wide robot control using robots.txt, web page creators can also specify that certain pages should not be indexed by search engine robots, or that the links on the page should not be followed by robots. The Robots META tag, placed in the `HTML <HEAD>` section of a page, can specify either or both of these actions.

You can use a special HTML `<META>` tag to tell robots not to index the content of a page, and/or not scan it for links to follow.

There are two important considerations when using the robots <META> tag:

robots can ignore your <META> tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the NOFOLLOW directive only applies to links on this page. It's entirely likely that a robot might find the same links on some other page without a NOFOLLOW (perhaps on some other site), and so still arrives at your undesired page.

Architecture:

The default values are now assumed to be INDEX, FOLLOW, ARCHIVE, ODP, SNIPPET and YDIR. There is no actual need to include these, unless someone on your internal web team needs reminding.

These values are usually combined into one line for all robots. If they don't understand a directive, they will just ignore it.

In general, it's better to use the same directives for all robots. While it's possible to include several lines with several robot crawler User-agent names, that might be an indicator of the bad kind of search cloaking (hiding the real page text from the search engine).

If you add Robots META tags to a framed site, be sure to include them on both the enclosing page and the frame content pages. The frameset could have NOINDEX, FOLLOW to avoid picking up any stray text on the frameset page.

Example:

<HEAD>
        <title>Should Not Be Indexed Or Followed</title>
  <META name="robots" content="NOINDEX,NOFOLLOW" />

</HEAD>

META Tags and its operations:

Task	Entry	Notes
Indexer: ignore content; Robot: follow links	`<META name="ROBOTS" content="NOINDEX">`	Use this for pages with many links on them, but no useful data, such as a site map. Because "follow" is the default, you don't have to include it.
Indexer: include content; Robot: do not follow links	`<META name="ROBOTS" content="NOFOLLOW, INDEX ">`	Use this for pages which have useful content but outdated or problematic links.
Indexer: ignore content; Robot: do not follow links	`<META name="ROBOTS" content="NOINDEX,NOFOLLOW">`	This is for sections of a site that shouldn't be indexed and shouldn't have links followed. Putting access control, such as a password, is much better for security.
Indexer: include content; Robot: follow links	`<META name="ROBOTS" content="INDEX,FOLLOW">`	This is the default behavior: you don't have to include these.
Search results pages should not show "cache" link	`<META name="ROBOTS" content="NOARCHIVE">`	Useful if the content changes frequently: headlines, auctions, etc. The search engine still archives the information, but won't show it in the results.
Search results pages should not display the Open Directory Project (ODP) title and description for the page.	`<META name="ROBOTS" content="NOODP">` Danny Sullivan provides good examples of how outdated descriptions and even titles show up when the ODP content is used for search results.	Encourages search engines to use the page title tag, and match term in context, or META Description tag content instead of the ODP content, which may be misleading or outdated.
Search results pages should not display the Yahoo Directory title and description for the page	`<META name="ROBOTS" content="NOYDIR">` (Yahoo Slurp robot only)	Same as above, only for the Yahoo directory, and the other search indexers will ignore it.
Search results pages should not display any description or text context for this page. Title only, I guess.	`<META name="ROBOTS" content="NOSNIPPET">`	Encourages the search engines to use the title only, and to suppress the "cache" link. Might be useful if the site has special plus box listings in search results, but otherwise, not so much.

Posted in: SEO tool,Web Informations