Setting up robots.txt: what should be hidden from the robot on the site?

Setting up robots.txt: what should be hidden from the robot on the site?

A robots.txt file is a set of directives (a set of rules for robots) that can be used to prevent or allow crawlers to index certain sections and files on your site, and provide additional information. Initially, using robots.txt, it was really only possible to prohibit indexing of sections, the ability to allow indexing appeared later, and was introduced by the search leaders Yandex and Google.

Robots.txt file structure

First, the User-agent directive is written, which shows which search robot the instructions belong to.

A small list of well-known and frequently used User-agents:

  • User-agent: *
  • User-agent: Yandex
  • User-agent: Googlebot
  • User-agent: Bingbot
  • User-agent: YandexImages
  • User-agent: Mail.RU

Next, the directives Disallow and Allow are indicated, which prohibit or allow indexing of sections, individual pages of the site or files, respectively. Then we repeat these actions for the next User-agent. At the end of the file, the Sitemap directive is indicated, where the address of your site map is set.

When prescribing the Disallow and Allow directives, you can use the special characters * and $. Here * stands for “any character” and $ stands for “end of line”. For example, Disallow: /admin/*.php means that indexing of all files that are in the admin folder and end with .php is prohibited, Disallow: / admin $ disallows the address / admin, but does not disallow /admin.php, or / admin / new / if there is one.

If the User-agent uses the same set of directives for all, there is no need to duplicate this information for each of them, User-agent: * will suffice. In the case when it is necessary to supplement the information for some of the user-agent, you should duplicate the information and add a new one.

Example robots.txt for WordPress:

* Note for User agent: Yandex

  • To send the Url to the Yandex robot without Get parameters (for example:? Id =,? PAGEN_1 =) and utm tags (for example: & utm_source =, & utm_campaign =), you must use the Clean-param directive.

    Clean-param

  • Previously, the Yandex robot could be told the address of the main site mirror using the Host directive. But this method was abandoned in the spring of 2018.

  • Also, earlier it was possible to tell the Yandex robot how often to access the site using the Crawl-delay directive. But as reported in the blog for Yandex webmasters:

    • After analyzing letters from our support for indexing issues over the past two years, we found out that one of the main reasons for slow downloading of documents is an incorrectly configured Crawl-delay directive.
    • In order for site owners not to have to worry about this anymore and so that all really necessary site pages appear and update in the search quickly, we decided to abandon the Crawl-delay directive.

    Instead of this directive, Yandex. The webmaster added a new section “Crawl speed”.

Robots.txt check

Old version of Search console

To check the correctness of the robots.txt compilation, you can use the Google Webmaster – you need to go to the “Crawl” section and then “View as Googlebot”, then click the “Get and display” button. As a result of the scan, two screenshots of the site will be presented, showing how the site is seen by users and how search robots. And below will be a list of files, the prohibition of indexing which prevents the correct reading of your site by search robots (they will need to be allowed for indexing for the Google robot).

Checking robots.txt in the old version of the search console

Typically, these can be various style files (css), JavaScript, and also images. After you allow these files to be indexed, both screenshots in the Webmaster must be identical. Exceptions are files that are located remotely, for example, the Yandex.Metrica script, social media buttons, etc. You will not be able to prohibit / allow them to be indexed. You can read more about how to fix the error “Googlebot cannot access CSS and JS files on the site” in our blog.

New version of Search console

The new version does not have a separate menu item for checking robots.txt. Now you just need to insert the address of the required country into the search bar.

Robots.txt check in old new search console

In the next window, click “Explore the scanned page”.

Examine the scanned page

Next, click on the page resources

Page resources

In the window that appears, you can see the resources that, for one reason or another, are not available to the google robot. In the specific example, there are no resources blocked by the robots.txt file.

Unavailable page resources

If there are such resources, you will see messages of the following form:

Page resources blocked by robots.txt

Recommendations on what to close in robots.txt

Each site has a unique robots.txt, but some common features can be highlighted in the following list:

  • Close authorization, registration, password recall and other technical pages from indexing.
  • Resource admin panel.
  • Sort pages, pages of the type of information display on the site.
  • For online stores, shopping cart pages, favorites. You can read more in tips for online stores on indexing settings in the Yandex blog.
  • Search page.

This is just a rough list of what can be closed from indexing from search engine robots. In each case, you need to understand on an individual basis, in some situations there may be exceptions to the rules.

Conclusion

The robots.txt file is an important tool for regulating the relationship between a site and the search engine spider, it is important to take the time to customize it.

The article contains a lot of information about Yandex and Google robots, but this does not mean that you need to create a file only for them. There are other robots – Bing, Mail.ru, etc. You can supplement robots.txt with instructions for them.

Many modern cms create a robots.txt file automatically, and they may contain outdated directives. Therefore, after reading this article, I recommend that you check the robots.txt file on your site, and if they are present there, it is advisable to delete them. If you do not know how to do this, please contact us for help.

Leave a Reply

Your email address will not be published. Required fields are marked *