Your website surely has unnecessary parts and you don’t want search engines to waste time crawling them.
If you streamline the work of search engines like Google by making the crawl of your website as accessible and fast as possible, you will reach the really important parts that you want to position.
What is robots.txt?
The robots.txt is a small file that allows search engine bots to indicate which parts of a website they can and cannot crawl.
When the Google, Bing, Yandex or any other search engine accesses a website, the first thing it does is access the Robots.txt (if it has been created).
Based on this, it will make one or other analysis decisions based on the orders that we have marked.
If you have not created it, it will access without paying attention to any order or prohibition previously indicated which can sometimes be harmful, especially on websites with a lot of content.
In summary: the Robots.txt is not mandatory but it is highly recommended to optimize it because it will help the positioning of your website .
Why use robots.txt?
In the same way, you must make sure that Google does not index unimportant urls on your website, indicating as noindex pages such as “privacy policy” “cookie policy”, “legal notice” “urls with content that is not searched on Google”; You must use the robots to more drastically close Google access to those urls or parts of the web.
We cannot confuse noindex with robots.txt as their functions are different.
Noindex : It does not show a certain pages in the Serps or what the same, allows Google to index a content.
Robots.txt : Blocks access to marked urls so that Google cannot read the html, which includes reading the noindex.
Google recommends using noindex to not show urls in serps:
If you don’t want certain pages to appear in search results, “don’t use the robots.txt file to hide your web page”, as Google explains
Although the orders indicated in the robots.txt are usually obeyed by search engines, they are not 100% effective, Google or others being able to ignore the indicated instructions and track the blocked urls.
According to Google, the information you give in your robots.txt file are instructions, not rules.
If multiple links point to this page, it is possible that Google will index it and display it in its search results, without knowing what it contains, even if you have blocked it in your robots.txt file
How to create a robots.txt
To see if a website has created a robots.txt file, you just have to indicate after the domain /robots.txt .
Example: webempresa.com.com /robots.txt
The creation of the file robots.txt is very simple and you can do this in several ways:
1.Creating a .txt file noting what you want to block and uploading it to the root of your website.- Open a blog of notes, indicate the directives you want and save it with the name robots.txt.
– Now you just have to upload it to the root of your website and that’s it.
2.Using a plugin like Yoast Seo .- Access the Yoast tools option and click Create Robots.txt .
As you can see, create a predesigned robot by default that you can save and you already have it created in the absence of indicating the sitemap .
To do this, whether you use Yoast or the notes blog, you must indicate the following line:
– If you use the Google xml Sitemap plugin:
Sitemap: yourwebsite.com/sitemap.xml
– Or if you use the Yoast Seo sitemap :
Sitemap: yourwebsite.com/sitemap_index.xml
To insert the sitemap inside the robots.txt you just have to copy the path of your sitemap with your domain in the robots.txt
Remember to paste it at the end of the robots.txt .
How to tell Google that I have created the robots.txt ?
Very simple, access your Google Search Console and in the Help section (?) Write: robots.txt .
The first result will appear: Test your robots.txt file . Click and the robots.txt tool will open .
Note: Keep in mind that uploading the Robots.txt file is not the same as validating it in Google Search Console, to validate it you will have to have previously created it.
Choose a property (your domain) and copy and paste the robots.txt created in the notes blog or Yoast.
Click submit and choose the option “Request the update from Google . “
I leave you a direct link to the Robots.txt validator of Search Console
Make sure it doesn’t generate an error.
Now you just have to access the browser and type yourwebsite.com/robots.txt and see if it shows up.
Google Search console keeps making improvements and changes to its interface, it may change or add this tester option on another site later
With this, you will have your robots created and you will be facilitating Google tracking, although everything you want is customizable.
Note: You have to be careful because every website is different and an error in a simple * o / can make Google not track parts that are important.
We are going to see the robots created by default and the different options to create a custom robot.
How is the robots.txt created by default in WordPress
This is the default code created.
user-agent: *
Disallow: / wp-admin /
Allow: /wp-admin/admin-ajax.php
* Here you should add the line of your sitemap.
In them we see disallow (not allow) and allow (allow).
In it there are 3 lines:
User-agent: * : Allows all search engines to crawl your website.
Disallow: / wp-admin : Prevents search engines from wasting time crawling the WordPress admin.
Allow: /wp-admin/admin-ajax.php/ : Within the above prohibition, search engines must crawl admin-ajax-php .
Note: If your website was blocked for search engines (option in the WordPress settings while designing a website) your robots would have a Disallow: /
You have to be careful because if it shows this you are telling Google not to crawl anything on your website at all .
The most important part in a robots is found in the wildcards , it is important to know all the codes to use such as the * sign, the $ etc … Let’s see it
Wildcards to use in a robots.txt.
Note: In the creation of robots you must respect uppercase, lowercase, spaces.
It is not the same to indicate / wp-content as / WP-content.
Any error in a space, an out-of-site symbol can greatly harm your website in terms of positioning.
Pad (#)
This symbol can simply be used to annotate comments indicating what the different lines to be treated mean.
Example: #blocking searches or #blocking trackbacks .
In this way you will have control of what you want to indicate and it will be more organized.
User-agent
Indicate which bots you want to target. The normal thing is that all bots access your website, so by default it is:
But if, for example, you only want the Google robot to have access or establish a specific rule for Google, you will have to add the line:
Everything you add below will apply exclusively to the Googlebot.
Note: It is important that the user-agents are separated by spaces because otherwise if they are together, the rules below will apply to all.
If we establish several user-agents of the same bot, the most specific or long will be the one that sends.
Example:
User-agent: Googlebot-Image
Disallow: /
shall prevail over:
User-agent: Googlebot
Disallow:
(allows tracking when not carrying / )
The asterisk (*)
This is a wildcard symbol that represents any sequence of characters .
If for example we indicate /*.pdf we will be referring to all the files that contain .pdf.
And it will be valid for both yourwebsite.com/document.pdf and yourwebsite.com/document.pdf?ver=1.1
The use of the asterisk is very important
Suppose you want to prevent search engines from accessing parameterized product category URLs on your site.
You can do it like this:
User-agent: *
Disallow: / products / t-shirts?
Disallow: / products / shirts?
Disallow: / products / coats?
Or make use of the * (best option)
User-agent: *
Disallow: / products / *?
By marking the * you will be telling search engines not to track absolutely any product with parameters.
The dollar symbol ($)
If there is any character after this symbol ( $ ) the rule will not be applied .
If for example we indicate /*.pdf$ we will be referring to all the files that end with a .pdf.
This includes yourwebsite.com/document.pdf but excludes yourwebsite.com/document.pdf?ver=1.1
Use the wildcard ” $ ” to mark the end of a URL.
For example, if you want to prevent search engines from accessing all the .pdf files on your site, your robots.txt file might look like this:
User-agent: *
Disallow: /*.pdf$
Noindex and nofollow
Note: Since September 1, 2019 it is not recommended to mark noindex in the robots , although you can and should use it in the meta robots tag or the HTTP header x-robots instead.
In the same way, you should not use nofollow either .
Disallow
This directive prevents search engines from crawling a specific page , category, or structure .
A disallow prohibits Google from entering it .
Image “disallow-robots”
Note: Even if you have a page that you assign disallow marked as noindex , bots can index it but not its content .
I explain to you; If you mark a page as noindex , the url or its internal content will be shown in the serps, although search engines can include the url in the serps with a meta description indicating that there are robots .
This can happen if they consider for example that the page has inbound links and is of quality.
What happens to the Linkjuice in a Disallow?
The linkjuice (force of a page) would not be transferred to another url if we are blocking the first one by robots.
I’ll explain:
Example : Imagine that we mark
Disallow: / service1 /
Service1 receives links from home and in turn service1 has links to service2 .
Service1 continue to receive his strength but not pierce to service2 because we have blocked by robots that url.
Blocking pages starting with …
Block all the urls that start with test page but not those that have something in front of them, for this you would need to include a * .
That is, it would block all URLs that started with / test page such as: yourwebsite.com/testpage/ or yourwebsite.com/testpage-image/contact .
But we would need a * in front:
yourwebsite.com/*testpage
If we want to block for example yourwebsite.com/example-test page or yourwebsite / category / test page
Folder lock
If you want to lock the folder of page test , you must place a sidebar to the end of the directive as follows:
In this way we will block all the URLs that contain said folder, such as:
tusitioweb.com/paginaprueba/
tusitioweb.com/paginaprueba/imagenes/
But we would not block those URLs that do not contain exactly that example folder:
yourwebsite.com/test-pictures/portfolio
yourwebsite.com/index/testpage
yourwebsite.com/tests-page
Another example :
Imagine that you clone your website by creating a subfolder on the server called / cop .
If you put:
not only will you be blocking the subfolder , you will also be blocking pages like / backup-copy / , / cooking-potatoes or / copier-epson /
The solution is to lock the entire folder by putting a / at the end
That is to say:
And as always, if we want to block all the URLs that contain / test page / regardless of the position we must use the following:
Disallow: / * / test page /
And remembering the $ , if what we want is to block all the URLs that end in test page we should use:
Disallow: / * test page $
Allow
The Allow function is the opposite of Disallow and is used exclusively to allow access to specific parts previously blocked by disallow .
Image “allow-robots”
Example : it is normal to block the / wp-content / plugins / folder since we do not want search engines to waste time here, but Google for example indicates that it should have access to the .css and .js files .
As these files exist in this folder, we must give permission for tracking as follows:
Disallow: / wp-content / plugins /
Allow: /wp-content/plugins/*.js
Allow: /wp-content/plugins/*.css
Imagine that you want to block the entire blog except one entry, you can apply the following;
User-agent: *
Disallow: / blog
Allow: / blog / post-allowed
How to optimize the Robots to the maximum
There is no fixed rule and you have to be careful when replicating robots from other websites as it can be counterproductive.
An example of a standard robot with some rules can be the following:
# Block or allow access to attached content. (If the installation is in / public_html).
User-agent: *
Disallow: / cgi-bin
Disallow: / wp-content / plugins /
Disallow: / wp-content / themes /
Disallow: / wp-includes /
Disallow: / wp-admin /
# Prevent access to the different feed generated by the page
Allow: / feed / $
Disallow: / feed
Disallow: / comments / feed
Disallow: / * / feed / $
Disallow: / * / feed / rss / $
Disallow: / * / trackback / $
Disallow: / * / * / feed / $
Disallow: / * / * / feed / rss / $
Disallow: / * / * / trackback / $
Disallow: / * / * / * / feed / $
Disallow: / * / * / * / feed / rss / $
# Prevent URLs ending in / trackback / that serve as Trackback URLs.
Disallow: / * / * / * / trackback / $
# Avoid CSS and JS crashes.
Allow: /*.js$
Allow: /*.css$
#Lock all pdfs
Disallow: /*.pdf$
#Lock parameters
Disallow: / *?
# List of bots you should allow.
User-agent: Googlebot-Image
Allow: / wp-content / uploads /
User-agent: Adsbot-Google
Allow: /
User-agent: Googlebot-Mobile
Allow: /
# List of blocked bots
User-agent: MSIECrawler
Disallow: /
User-agent: WebCopier
Disallow: /
User-agent: HTTrack
Disallow: /
User-agent: Microsoft.URL.Control
Disallow: /
User-agent: libwww
Disallow: /
User-agent: Baiduspider
Disallow: /
User-agent: GurujiBot
Disallow: /
User-agent: hl_ftien_spider
Disallow: /
User-agent: sogou spider
Disallow: /
User-agent: Yeti
Disallow: /
User-agent: YodaoBot
Disallow: /
#Disallow unnecessary pages
Disallow: / thanks-for-subscribing
# We add an indication of the location of the sitemap
Sitemap: https: //website/sitemap_index.xml
Note: You may want to disallow comments, tags etc … Every website is different but think if you want search engines to waste time tracking that.
Do you need a robots.txt file?
This is the question you can ask yourself.
As I mentioned having a robots.txt is not essential for small sites, although my recommendation is that you can use it and improve the rankings when you use it.
A good robots.txt created can help you in:
Maintenance tasks to be able to include a Disallow : /
Prevention of server overload , since there may be fewer requests to unnecessary pages.
Prevent Google from wasting your crawl budget by increasing the likelihood of longer and better access to relevant pages.
You can avoid duplicate content . You can prevent pages such as checkout or cart from being tracked in an online store (Disallow: / checkout / and Disallow: / cart /)
…
The robots.txt alone is independent of subdomains .
In other words, if you have a subdomain created, you have to create a specific robot for the subdomain created.
Conclusion
The robots can help a lot when tracking your website, but you have to make sure that it works well.
A simple comma or wrong capital letter can do significant SEO damage.
Whether or not it is necessary at all sites depends. You have to know that, on small websites with simple architectures, the truth is that search engines track it without problems.
There are even important SEO websites that say not to use robots.txt since Google is smart enough to understand a website.
However, I always say that in SEO everything helps no matter how small .
If you can make Google prioritize and understand your site better, saving it time, my recommendation is that you use a consistent bot without going crazy.
So far the Guide on Robots, I recommend that you work on your robots.txt and increase crawling!