Robots.txt is a text file to tell crawlers, which files or URLs of the site should not be crawled. So, robots.txt is called site exclusion protocol unlike sitemap.xml which is called site inclusion protocol, because it tells the crawlers to crawl specific URLs.
We have already learnt about sitemap.xml in our previous video ().
Robots.txt is located in the root folder of the site named as “robots.txt” all in small letters. If this file is not found, then crawlers will crawl all the URLs & files.
But, why you will require to hide the URLs or files from crawlers to crawl. Then, the answer is, there are some bad crawlers from which you want to hide your site. Or there are some files which are in development mode which you don’t want to show unless development is done. Or sometimes there are certain sites for which admin panels are there to manage the data which should not be available for public. So, it should not be crawled.
Now, we will see how to prevent certain areas of the site from being crawled. For this, you will have to learn how to write robots.txt.
There are 5 directives which are used in robots.txt: User-agent, disallow, allow, crawl-delay & sitemap.
In User-agent directive, you can define name of the user agent, for which rule should get applied. For e.g. Googlebot, Bingbot, etc. If you want to apply the rule for all user agents, then you can just put * symbol. You can specify multiple user agents with different rules for each user agent. Or if you have same rule for multiple User-agents then you can give this directive multiple times with one Disallow directive. You can download the list of bad User-agents & disallow it in your robots.txt.
In Disallow directive, you can specify the folders or files which you want to disallow. You can have multiple disallow directives for one User-agent.You can also use wildcard entries. For e.g. if you want to disallow all URLs ending with .pdf then you should specify disallow directive as *.pdf$. Please note that if you specified entry without $ sign then it will disallow all URLs which contains .pdf & not URLs ending with .pdf. For e.g. it will also hide ticket.pdf?id=123. So, if you want to hide your documentation or data files from being searched you can use this wildcard. For e.g. you can hide your excel and word files with *.xls$ & *.doc$.
If you want to allow specific file or directory from disallowed directory then you can use allow directive. For e.g. If you want to disallow all files from banners directory except homebanner.jpg then you can specify directives like this. Please note that in this case you have to mention the allow directive before disallow directive.
In Crawl-delay directive you can specify no. of milliseconds a crawler should wait before loading and crawling page content. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it.
In Sitemap directive you should specify the absolute URL to your sitemap.xml. Please note that, this directive is independently written. It is not written with User-agent directive. This directive should be written at the end of the file. You can have multiple sitemap directives if you have multiple sitemaps for your site.
Out of these 5 directives User-agent & Disallow are main directives. Other directives are optional.
While writing robots.txt you can give comments in the file for your understanding. Comments start with ‘#’ symbol and it can be written at the start of the line or after directive.
Remember that you have to save robots.txt in the root directory of your site & if you have subdomain then you have to keep separate robots.txt for it. Also, each protocol and port needs its own robots.txt file. For e.g. , & would have separate robots.txt files.