All about Robots.txt and Robots META tag

What is robots.txt?

www.google.com/robots.txt is the robots.txt of Google.

 The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard is different from, but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites.

Robots, including search indexing tools and intelligent agents, should check a special file in the root of each server called robots.txt, which is a plain text file (not HTML). Robots.txt implements the REP (Robots Exclusion Protocol), which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. Web administrators can Allow access to their web content and Disallow access to cgi, private and temporary directories, for example, if they do not want pages in those areas indexed.
In June, 2008 webwide search engine companies Yahoo, Google, and Microsoft agreed to extend the Robots Exclusion Protocol. They added elements to robots.txt: an Allow directive, wildcards in URLs, and a link to a sitemap for ease of crawling, IP authentication to identify search engine indexing robots, the X-Robots-Tag header field for non HTML documents, and some additional META robot tag attributes.

How it can be used?

If a site owner wishes to give instructions to web robots they must place a text file called robots.txt in the root of the web site hierarchy (e.g. www.example.com/robots.txt). This text file should contain the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist web robots assume that the web owner wishes to provide no specific instructions.


A robots.txt file on a website will function as a request that specified robots ignore specified files or directories in their search. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data.


For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com.

www.google.com/robots.txt is the robots.txt of Google.


How to create a /robots.txt file?

Where to put it?

The short answer: in the top-level directory of your web server.
The longer answer: When a robot looks for the "/robots.txt" file for URL, it strips the path component from the URL (everything from the first single slash), and puts "/robots.txt" in its place.
For example, for "http://www.example.com/shop/index.html, it will remove the "/shop/index.html", and replace it with "/robots.txt", and will end up with "http://www.example.com/robots.txt".
So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software.
Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT

What to put in it?

The "/robots.txt" file is a text file, with one or more records. Usually contains a single record looking like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

In this example, three directories are excluded. 

Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records. 

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif". 

What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:
To exclude all robots from the entire server
User-agent: *
Disallow: /

To allow all robots complete access
User-agent: *
Disallow:
(or just create an empty "/robots.txt" file, or don't use one at all)
To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: Google
Disallow:

User-agent: *
Disallow: /
To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:
User-agent: *
Disallow: /~joe/stuff/
Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

To Reduce the Number of Requests the Search Crawler Makes on Your Site
If you occasionally get high traffic from search crawlers, you can specify a crawl delay parameter in the robots.txt file to specify how often, in seconds, search crawlers can access your website. To do this, add the following syntax to your robots.txt file:

User-agent: *
Crawl-delay: 10

Individual crawler sections override the settings that are specified in * sections. If you've specified Disallow settings for all crawlers, you must add the Disallow settings to the search crawler section you create in the robots.txt file. For example, your robots.txt file might have the following:

User-agent: *
Disallow: /private/

If you add a specific search crawler section, you must add any Disallow settings to that section.
For example:

User-agent: *
Crawl-delay: 10
Disallow: /private/

Disadvantages of using robots.txt in websites:


Unfortunately, there are two big problems with robots.txt:

  • Rate control

    You can only specify what paths a bot is allowed to spider. Even allowing just the plain page area can be a huge burden when two or three pages per second are being requested by one spider over two hundred thousand pages.

    Some bots have a custom specification for this; Inktomi responds to a "Crawl-delay" line which can specify the minimum delay in seconds between hits. (Their default is 15 seconds.)
  • Evil bots

    Sometimes a custom-written bot isn't very smart or is outright malicious and doesn't obey robots.txt at all (or obeys the path restrictions but spiders very fast, bogging down the site). It may be necessary to block specific user-agent strings or individual IPs of offenders.

    More generally, request throttling can stop such bots without requiring your repeated intervention. 
Robots.txt architecture:
  • The exact mixed-case directives may be required, so be sure to capitalize Allow: and Disallow: , and remember the hyphen in User-agent:
  • An asterisk (*) after User-agent:: means all robots. If you include a section for a specific robot, it may not check in the general all robots section, so repeat the general directives.
  • The user agent name can be a substring, such as "Googlebot" (or "googleb"), "Slurp", and so on. It should not matter how the name itself is capitalized.
  • Disallow tells robots not to crawl anything which matches the following URL path
  • Allow is a new directive: older robot crawlers will not recognize this.
  • URL paths are often case sensitive, so be consistent with the site capitalization
  • The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
  • In the original REP directory paths start at the root for that web server host, generally with a leading slash (/). This path is treated as a right-truncated substring match, an implied right wildcard.
  • One or more wildcard (*) characters can now be in a URL path, but may not be recognized by older robot crawlers
  • Wildcards do not lengthen a path -- if there's a wildcard directive path that's shorter, as written, than one without a wildcard, the one with the path spelled out will generally override the one with the wildcard.
  • Sitemap is a new directive for the location of the Sitemap file
  • A blank line indicates a new user agent section.
  • A hash mark (#) indicates a comment

Some of robot.txt entries: (see google.com/robots.txt for real-time example)

Entry Meaning
User-agent: *
Disallow:
Because nothing is disallowed, everything is allowed for every robot
User-agent: mybot
Disallow: /
Specifically, the mybot robot may not index anything, because the root path (/) is disallowed.
User-agent: *
Allow: /
For all user agents, allow everything ( 2008 REP update)
User-agent: BadBot
Allow: /About/robot-policy.html Disallow: /
The BadBot robot can see the robot policy document, but nothing else. All other user-agents are by default allowed to see everything. This only protects a site if "BadBot" follows the directives in robots.txt
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private
In this example, all robots can visit the whole site, with the exception of the two directories mentioned and any path that starts with private at the host root directory, including items in privatedir/mystuff and the file privateer.html
User-agent: BadBot
Disallow: /

User-agent: *
Disallow: /*/private/*
The blank line indicates a new "record" - a new user agent command. BadBot should just go away. All other robots can see everything except any subdirectory named "private" ( using the wildcard character)
User-agent: WeirdBot
Disallow: /links/listing.html
Disallow: /tmp/
Disallow: /private/

User-agent: *
Allow: /
Disallow: /temp*
Allow: *temperature*

Disallow: /private/
This keeps the WeirdBot from visiting the listing page in the links directory, the tmp directory and the private directory.
All other robots can see everything except the temp directories or files, but should crawl files and directories named "temperature", and should not crawl private directories. Note that the robots will use the longest matching string, so temps and temporary will match the Disallow, while temperatures will match the Allow.
If you think this is inefficient, you're right.



 

Robots <META> tag:

 In addition to server-wide robot control using robots.txt, web page creators can also specify that certain pages should not be indexed by search engine robots, or that the links on the page should not be followed by robots. The Robots META tag, placed in the HTML <HEAD> section of a page, can specify either or both of these actions.  

You can use a special HTML <META> tag to tell robots not to index the content of a page, and/or not scan it for links to follow.

There are two important considerations when using the robots <META> tag:

  • robots can ignore your <META> tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the NOFOLLOW directive only applies to links on this page. It's entirely likely that a robot might find the same links on some other page without a NOFOLLOW (perhaps on some other site), and so still arrives at your undesired page.

 Architecture:

The default values are now assumed to be INDEX, FOLLOW, ARCHIVE, ODP, SNIPPET and YDIR. There is no actual need to include these, unless someone on your internal web team needs reminding.

These values are usually combined into one line for all robots. If they don't understand a directive, they will just ignore it. 


In general, it's better to use the same directives for all robots. While it's possible to include several lines with several robot crawler User-agent names, that might be an indicator of the bad kind of search cloaking (hiding the real page text from the search engine).


If you add Robots META tags to a framed site, be sure to include them on both the enclosing page and the frame content pages. The frameset could have NOINDEX, FOLLOW to avoid picking up any stray text on the frameset page.

Example:

<HEAD>
        <title>Should Not Be Indexed Or Followed</title>
  <META name="robots" content="NOINDEX,NOFOLLOW" />
</HEAD> 

META Tags and its operations:


Task Entry Notes
Indexer: ignore content;
Robot: follow links
<META name="ROBOTS" content="NOINDEX">
Use this for pages with many links on them, but no useful data, such as a site map. Because "follow" is the default, you don't have to include it.
Indexer: include content;
Robot: do not follow links
<META name="ROBOTS" content="NOFOLLOW, INDEX ">
Use this for pages which have useful content but outdated or problematic links.
Indexer: ignore content;
Robot: do not follow links
<META name="ROBOTS" content="NOINDEX,NOFOLLOW">
This is for sections of a site that shouldn't be indexed and shouldn't have links followed. Putting access control, such as a password, is much better for security.
Indexer: include content;
Robot: follow links
<META name="ROBOTS" content="INDEX,FOLLOW">
This is the default behavior: you don't have to include these.
Search results pages should not show "cache" link <META name="ROBOTS" content="NOARCHIVE"> Useful if the content changes frequently: headlines, auctions, etc. The search engine still archives the information, but won't show it in the results.
Search results pages should not display the Open Directory Project (ODP) title and description for the page.
<META name="ROBOTS" content="NOODP">
Danny Sullivan provides good examples of how outdated descriptions and even titles show up when the ODP content is used for search results.
Encourages search engines to use the page title tag, and match term in context, or META Description tag content instead of the ODP content, which may be misleading or outdated.
Search results pages should not display the Yahoo Directory title and description for the page
<META name="ROBOTS" content="NOYDIR">
(Yahoo Slurp robot only)
Same as above, only for the Yahoo directory, and the other search indexers will ignore it.
Search results pages should not display any description or text context for this page. Title only, I guess. <META name="ROBOTS" content="NOSNIPPET"> Encourages the search engines to use the title only, and to suppress the "cache" link. Might be useful if the site has special plus box listings in search results, but otherwise, not so much.

1 comments:

Android apps developer said...

Very useful. I read all. Thanks for the share.

Post a Comment