A Beginner's Guide to Your robots.txt File
Posted by drew on July 20, 2014, 6 p.m.
Google has announced that they have updated the robots.txt testing tool in their webmaster tools area. This is great, for those of us who are used to writing a robots.txt file, but what about those site owners who don’t even know what a robots.txt file is, let alone how to work with it to improve user experience, and SEO? Here is a foundational crash course in a robots.txt file
What is a robots.txt file?
A robots.txt file is used by webmasters to tell search engines which sections or pages on their sites should be crawled and indexed, and which parts should be ignored. It is just a simple text file, and can look something like this:
What do these lines mean? The first line, for the user-agent, helps you specify which search engine bots you are talking to. By including the star symbol (*), you indicate that you want to speak to any search engine crawler that come along.
You could, for instance, choose only to address the web spider for Google (called Googlebot). In general, it is better to just address all robots at once, because rarely will you want them to act different for different search engines.
The second line of this file indicates that the web spiders who come to the site should ignore any pages in the /admin/ subdirectory. This would mean that none of these pages would be considered in SEO ranking or indexed to be included in the search results.
Why would you want to exclude certain pages?
Sometimes, a client of mine asks me why it would be a good choice to exclude pages, presumably because they think it decreases their chances of having a page rank in the search engines.
However, there are often pages on a site you would never want to show up in a search engine result. In the above case, the /admin/ page of the site is only useful to those who manage the site, so if a Google user were to find it, it would provide a terrible user experience.
In addition to this, there is no useful content on the admin login page, so it would just look like a page with thin content to a search engine. Too many thin-content pages can hurt your domain-wide SEO, so I would recommend disallowing these via a robots.txt file.
But don’t get crazy! Don’t just chop away at your site!
When thinking about your SEO, you usually think about landing pages, and how you’d like users to flow through your site. It would be a bad idea to simply disallow all pages that aren’t your designated “landing pages.” While this would ensure that search engine users don’t enter your site on a page you don’t like, it would also make your site look much smaller.
Search engines only read and index the pages you allow them to in a robots.txt file (Some actually say they crawl every page, but ignore the content on disallowed pages. The effect is the same). So removing off large portions of your site from the search engine index makes it seem, in the robots’ eyes, that your site is small and has much less content and authority than it really does.
When choosing which pages to allow and disallow, take into account user experience and SEO. Also, be sure that you accurately list the pages you want to disallow, so you don’t end up removing whole sections you don’t want to.
This post doesn’t cover any of the steps for creating and implementing a robots.txt file. If you are a beginner with this technology, it is best to have a programmer implement the file, as they will be more experienced in creating it, as well as the regular expressions used within it.
This was only meant to give new site owners an introduction into the ways in which a robots.txt file can help or hurt their user experience and SEO.