Robots txt Guide
All About Robots.txt
Robots.txt is the file telling search engine spiders not to crawl some sections or pages of the website. Many of the major search engines such as Yahoo, Bing, and Google honor and recognize robots.txt requests.
How Robots.txt File Work
A robots.txt file is only a text file without HTML markup code that explains the .txt extension. This robots.txt file is being hosted on web server similar to other website files. The truth is that the robots.txt file for a specific website can often be viewed through typing the homepage’s full URL and adding /robots.txt. This file is not linked anywhere else on the website, which means that there is no chance for users to stumble on it. However, many web crawler bots search for the file first prior to crawling other parts of the site.
Even though a robots.txt files gives instructions for bots, this doesn’t have the ability of enforcing the instructions. Good bots, like news feed or web crawler bot, will try visiting first the robots.txt before they view other pages on the domain and follow the instructions. Meanwhile, bad bots will just ignore this robots.txt file or process this to look for the forbidden webpages.
Web crawler bots will follow the robots.txt file’s most specific instruction set. If the file contains contradictory commands, the bot will adhere to the more granular command.
An important thing you should note is that every subdomain needs its very own robots.txt file.
Importance of Robots.txt
Many websites don’t really require a robots.txt file. This is because Google can often find and index all of your website’s important pages.
Also, they will automatically not index those pages that are unimportant or those that are only duplicate versions of some other pages.
If you don’t want to work the technical part of SEO by yourself, get an effective full-service SEO strategies from a leading SEO agency in Thailand like us, today!
Having said this, there are three primary reasons why you would want to use robots.txt files.
- Block pages that are non-public.
There are times when there are pages on your website that you don’t really want to be indexed. For instance, you might have a login page or a staging version of a particular page. These pages must exist yet you wouldn’t want just anyone to land on them. It is the best time for you to use robots.txt for blocking the pages from search engine bots and crawlers.
- Make best use of your crawl budget.
If you find it hard to index all of your pages, you might have a problem with your crawl budget. With the use of robots.txt to block unimportant pages, Googlebot will be able to spend more of the crawl budget on pages that really matter.
- Avoid indexing of resources.
The use of meta directives work equally well as robots.txt to prevent pages from being indexed. But, meta directives don’t really work well for the multimedia resources such as images and PDFs. This is where robots.txt enters the picture.
The point here is that robots.txt informs search engines not to crawl certain pages on your site.
You could refer to Google Search Console to check the number of indexed pages you have. There is no need for you to bother with robots.txt file if the number is matched with the number of pages you want to index.
However, if the number happens to be higher than what you expected and there are URLs that were indexed when they shouldn’t be, then, it is about time for you to create your website’s robots.txt file.
Best Practices for Robots.txt File
- Come up with a robots.txt file.
The first best practice that you can do is to create the robots.txt file itself. Since this is a text file, you can use Windows notepad to create one. And whatever way you use to make this file, the format will remain exactly the same:
The user-agent refers to the specific bot you are talking to while all the rest that come after disallow are sections or pages you like to block.
An asterisk sign can also be used to talk to all and any bot that stops by your site.
This is only the first of many other ways of using robots.txt files. You can refer to Google’s helpful guide to find more information regarding the different rules that you can use for blocking or allowing bots from crawling various pages of your website.
- Ensure that your txt can be found easily.
After you are done creating your robots.txt file, the next step is to make it go live. Technically, you can put your robots.txt file in any of your website’s main directories.
However, to increase the chances of finding your robots.txt file, it is recommended to place it at https://example.com/robots.txt. Don’t forget that robots.txt files are case sensitive so ensure that the filename uses letter “r” in lower case.
- Watch out for mistakes and errors.
It is extremely important for you to set up your robots.txt the right way. A single mistake can make your site go deindexed.
The good news is that you don’t have to hope for setting up your code right. You can use Google’s handy Robots Testing Tool showing you your robots.txt file together with the warnings and errors it finds.
- Learn the difference between meta directives and txt.
Why will you bother using robots.txt if you can use noindex meta tag for blocking pages at page-level? Implementing the noindex tag can be very tricky on multimedia resources such as PDFs and videos.
If you want to block thousands of pages, there are times when it is easier blocking that site’s entire section with robots.txt instead of manual addition of noindex tag to each page.
Some edge cases also happen when you wouldn’t want to waste your crawl budget on Google arriving on pages with noindex tag.
This means that outside of these three edge cases, it is recommended to use meta directives and not robots.txt. It is easier to implement them and there are lesser chances for disasters to occur such as getting your entire website blocked.