Your website surely has unnecessary parts and you don’t want search engines to waste time crawling them.
If you streamline the work of search engines like Google by making the crawl of your website as accessible and fast as possible, you will reach the really important parts that you want to position.
What is robots.txt?
robots.txt is a small file that allows search engine bots to indicate which parts of a website they can and cannot crawl.
When the Google, Bing, Yandex or any other search engine accesses a website, the first thing it does is access the Robots.txt (if it has been created).
Based on this, it will make one or other analysis decisions based on the orders that we have marked.
If you have not created it, it will access without paying attention to any order or prohibition previously indicated which can sometimes be harmful, especially on websites with a lot of content.
In summary: the Robots.txt is not mandatory but it is highly recommended to optimize it because it
will help the positioning of your website .
Why use robots.txt?
In the same way, you must make sure that Google does not index unimportant urls on your website, indicating as
We cannot confuse
noindex with robots.txt as their functions are different.
: It does not show a certain pages in the Serps or what the same, allows Google to index a content. Noindex
: Blocks access to marked urls so that Google cannot read the html, which includes reading the noindex. Robots.txt
Google recommends using noindex to not show urls in serps:
If you don’t want certain pages to appear in search results, “don’t use the robots.txt file to hide your web page”, as Google explains
Although the orders indicated in the
robots.txt are usually obeyed by search engines, they are not 100% effective, Google or others being able to ignore the indicated instructions and track the blocked urls.
According to Google, the information you give in your
robots.txt file are instructions, not rules.
If multiple links point to this page, it is possible that Google will index it and display it in its search results, without knowing what it contains, even if you have blocked it in your robots.txt file
How to create a robots.txt
To see if a website has created a
robots.txt file, you just have to indicate after the domain . /robots.txt
creation of the file is very simple and you can do this in several ways: robots.txt
1.Creating a noting what you want to block and uploading it to the root of your website.- Open a blog of notes, indicate the directives you want and save it with the name .txt file robots.txt.
– Now you just have to upload it to the root of your website and that’s it.
2.Using a plugin like Yoast Seo .- Access the option and click Yoast tools . Create Robots.txt
As you can see, create a predesigned robot by default that you can save and you already have it created in the absence of
indicating the sitemap .
To do this, whether you use Yoast or the notes blog, you must indicate the following line:
– If you use the Google xml Sitemap plugin:
– Or if you use the Yoast Seo
To insert the sitemap inside the
robots.txt you just have to copy the path of your sitemap with your domain in the robots.txt
Remember to paste it at the end of the
How to tell Google that I have created the
Very simple, access your
Google Search Console and in the Help section (?) Write: robots.txt .
The first result will appear:
. Click and the Test your robots.txt file robots.txt tool will open .
Note: Keep in mind that uploading the Robots.txt file is not the same as validating it in Google Search Console, to validate it you will have to have previously created it.
a property (your domain) and copy and paste the robots.txt created in the notes blog or Yoast.
submit and choose the option . “Request the update from Google “
I leave you a direct link to the Robots.txt validator of Search Console
Make sure it doesn’t generate an error.
Now you just have to
access the browser and type yourwebsite.com/robots.txt and see if it shows up.
Google Search console keeps making improvements and changes to its interface, it may change or add this tester option on another site later
With this, you will have your robots created and you will be facilitating Google tracking, although everything you want is customizable.
Note: You have to be careful because every website is different and an error in a simple * o / can make Google not track parts that are important.
We are going to see the robots created by default and the different options to create a custom robot.
How is the robots.txt created by default in WordPress
This is the default code created.
Disallow: / wp-admin /
Here you should add the line of your sitemap. *
In them we see
(not allow) and disallow (allow). allow
In it there are 3 lines:
: Allows all search engines to crawl your website. User-agent: *
: Prevents search engines from wasting time crawling the WordPress admin. Disallow: / wp-admin
: Within the above prohibition, search engines must crawl Allow: /wp-admin/admin-ajax.php/ admin-ajax-php .
Note: If your website was blocked for search engines (option in the WordPress settings while designing a website) your robots would have a Disallow: /
You have to be careful because if it shows this you are telling Google not to crawl anything on your website at all .
The most important part in a robots is found in the
wildcards , it is important to know all the codes to use such as the * sign, the $ etc … Let’s see it
Wildcards to use in a robots.txt.
Note: In the creation of robots you must respect uppercase, lowercase, spaces.
It is not the same to indicate / wp-content as / WP-content.
Any error in a space, an out-of-site symbol can greatly harm your website in terms of positioning.
This symbol can simply be used to annotate comments indicating what the different lines to be treated mean.
#blocking searches or #blocking trackbacks .
In this way you will have control of what you want to indicate and it will be more organized.
Indicate which bots you want to target. The normal thing is that all bots access your website, so by default it is:
But if, for example, you only want the Google robot to have access or establish a specific rule for Google, you will have to add the line:
Everything you add below will apply exclusively to the Googlebot.
Note: It is important that the user-agents are separated by spaces because otherwise if they are together, the rules below will apply to all.
If we establish several user-agents of the same bot, the most specific or long will be the one that sends.
shall prevail over:
(allows tracking when not carrying
The asterisk (*)
This is a wildcard symbol that
represents any sequence of characters .
If for example we indicate
/*.pdf we will be referring to all the files that contain .pdf.
And it will be valid for both
yourwebsite.com/document.pdf and yourwebsite.com/document.pdf?ver=1.1
The use of the asterisk is very important
Suppose you want to prevent search engines from accessing parameterized product category URLs on your site.
You can do it like this:
Disallow: / products / t-shirts?
Disallow: / products / shirts?
Disallow: / products / coats?
Or make use of the
(best option) *
Disallow: / products / *?
By marking the
you will be telling search engines not to track absolutely any product with parameters. *
The dollar symbol ($)
If there is any character after this symbol (
$ ) the rule will not be applied .
If for example we indicate
/*.pdf$ we will be referring to all the files that end with a .pdf.
yourwebsite.com/document.pdf but excludes yourwebsite.com/document.pdf?ver=1.1
Use the wildcard ”
$ ” to mark the end of a URL.
For example, if you want to prevent search engines from accessing all the .pdf files on your site, your
robots.txt file might look like this:
Noindex and nofollow
Note: Since September 1, 2019 it is not recommended to mark , although you can and should use it in the noindex in the robots meta robots tag or the HTTP header x-robots instead.
In the same way,
you should not use nofollow either .
prevents search engines from crawling a specific page , category, or structure .
disallow prohibits Google from entering it .
Note: Even if you have a page that you assign disallow marked as noindex , bots can index it but not its content .
I explain to you; If you mark a page as
noindex , the url or its internal content will be shown in the serps, although search engines can include the url in the serps with a meta description indicating that there are robots .
This can happen if they consider for example that the page has inbound links and is of quality.
What happens to the Linkjuice in a Disallow?
The linkjuice (force of a page) would not be transferred to another url if we are blocking the first one by robots.
Example : Imagine that we mark
Disallow: / service1 /
Service1 receives links from home and in turn service1 has links to service2 .
Service1 continue to receive his strength but not pierce to service2 because we have blocked by robots that url.
Blocking pages starting with …
Block all the urls that start with
test page but not those that have something in front of them, for this you would need to include a * .
That is, it would block all URLs that started with
/ test page such as: yourwebsite.com/testpage/ or yourwebsite.com/testpage-image/contact .
But we would need a
* in front: yourwebsite.com/*testpage
If we want to block for example
yourwebsite.com/example-test page or yourwebsite / category / test page
If you want to
lock the folder of page test , you must place a sidebar to the end of the directive as follows:
In this way we will
block all the URLs that contain said folder, such as:
But we would
not block those URLs that do not contain exactly that example folder:
Imagine that you
clone your website by creating a subfolder on the server called / cop .
If you put:
not only will
you be blocking the subfolder , you will also be blocking pages like / backup-copy / , / cooking-potatoes or / copier-epson /
The solution is to lock the entire folder by putting a
/ at the end
That is to say:
And as always, if we want to block all the URLs that contain
/ test page / regardless of the position we must use the following:
Disallow: / * / test page /
And remembering the
$ , if what we want is to block all the URLs that end in test page we should use:
Disallow: / * test page $
Allow function is the opposite of Disallow and is used exclusively to allow access to specific parts previously blocked by disallow .
Example : it is normal to block the / wp-content / plugins / folder since we do not want search engines to waste time here, but Google for example indicates that it should have access to the .css and .js files .
As these files exist in this folder, we must give permission for tracking as follows:
Disallow: / wp-content / plugins /
Imagine that you want to block the entire blog except one entry, you can apply the following;
Disallow: / blog
Allow: / blog / post-allowed
How to optimize the Robots to the maximum
There is no fixed rule and you have to be careful when
replicating robots from other websites as it can be counterproductive.
An example of a standard robot with some rules can be the following:
# Block or allow access to attached content. (If the installation is in / public_html).
Disallow: / cgi-bin
Disallow: / wp-content / plugins /
Disallow: / wp-content / themes /
Disallow: / wp-includes /
Disallow: / wp-admin /
# Prevent access to the different feed generated by the page
Allow: / feed / $
Disallow: / feed
Disallow: / comments / feed
Disallow: / * / feed / $
Disallow: / * / feed / rss / $
Disallow: / * / trackback / $
Disallow: / * / * / feed / $
Disallow: / * / * / feed / rss / $
Disallow: / * / * / trackback / $
Disallow: / * / * / * / feed / $
Disallow: / * / * / * / feed / rss / $
# Prevent URLs ending in / trackback / that serve as Trackback URLs.
Disallow: / * / * / * / trackback / $
# Avoid CSS and JS crashes.
#Lock all pdfs
Disallow: / *?
# List of bots you should allow.
Allow: / wp-content / uploads /
# List of blocked bots
User-agent: sogou spider
#Disallow unnecessary pages
Disallow: / thanks-for-subscribing
# We add an indication of the location of the sitemap
Sitemap: https: //website/sitemap_index.xml
Note: You may want to disallow comments, tags etc … Every website is different but think if you want search engines to waste time tracking that.
Do you need a robots.txt file?
This is the question you can ask yourself.
As I mentioned having a
robots.txt is not essential for small sites, although my recommendation is that you can use it and improve the rankings when you use it.
robots.txt created can help you in:
Maintenance tasks to be able to include a
Disallow : /
Prevention of server overload , since there may be fewer requests to unnecessary pages. Prevent Google from wasting your crawl budget by increasing the likelihood of longer and better access to relevant pages.
avoid duplicate content . You can prevent pages such as checkout or cart from being tracked in an online store (Disallow: / checkout / and Disallow: / cart /)
robots.txt alone is independent of subdomains .
In other words, if you have a subdomain created, you have to create a specific robot for the subdomain created.
can help a lot when tracking your website, but you have to make sure that it works well.
A simple comma or wrong capital letter can do significant SEO damage.
Whether or not it is necessary at all sites depends. You have to know that, on small websites with simple architectures, the truth is that search engines track it without problems.
There are even important SEO websites that say not to use
robots.txt since Google is smart enough to understand a website.
However, I always say that in SEO
everything helps no matter how small .
If you can make Google prioritize and understand your site better, saving it time, my recommendation is that you use a consistent bot without going crazy.
So far the Guide on Robots, I recommend that you work on your
robots.txt and increase crawling!