Search engines can’t – and won’t – help you to expose your content if your site is not 100% accessible and understandable.
And when we are talking about accessibility, the very first important factor is always the robots.txt file.
First of all, let's see ...
What is the purpose of the robots.txt file?
When your website is indexed by the search engines, basically, it’s crawled by robot programs called bots, crawlers or spiders (Googlebot, Bingbot, Yahoo Slurp, etc) in order to find and categorize all the content on your site. The bots will automatically index whatever they can find and “read”.
If you have any sections or content pieces (for example, expired offers, duplicate content, non-public pages, etc) that you don’t want to get indexed, you’ll have to inform the crawlers about these “banned” areas. In order to do that, you are going to need a so-called robots.txt file.
So, what is the robots.txt file? To put it simply: it’s a simple text document placed in the root of your website that will tell the search engines (crawlers) what they can and what they cannot index while crawling your website. Additionally, if you want to save some bandwidth, you can use the robots.txt file to exclude javascript files, stylesheets or certain images from indexing.
When the spiders visits your site, the very first thing they do is to check out the existence and the content of your robots.txt file. If you have created a robots.txt file with your own rules, the crawlers will listen to your requests and won’t index the pages that you have disallowed. In theory, you could use the robots meta-tag too in order to keep away the spiders from certain files, pages, folders, etc, but not all search engines read meta-tags, so it’s always better to use the robots.txt file.
As I already said, the robots.txt file must be placed in the main root directory of your website. The spiders won’t search your site to find a document with that name. If they can’t find it in the main directory (www.yourdomain.com/robots.txt) they will simply assume that your site doesn’t have a robots.txt file, and as a result they will index everything along their way.
Also, you need to know, that Wordpress automatically creates a robots.txt file and even if you can't see it (for example in an FTP application), the bots will find it.
And you can easily access it and view its content by simply entering the following in the address bar of your browser (don't forget to edit the yourdomain part!):
And of course, AIO SEO and YOAST (or any reliable SEO plugin) will allow you to edit the content of the robots.txt file.
But here's the golden rule! If you don't have crawling errors, recent traffic issues, etc, is always better to don't alter the original content of the file. Especially if you are not 101% sure about what you are doing ...
With that having said ...
The structure and the syntax of a robots.txt file it’s extremely simple. Basically, it’s a simple list containing pairs of user-agents (crawlers) and disallowed or allowed files or directories.
Here's a typical, "recommended" robots.txt file template (again, don't forget to edit the yourdomain part):
User-Agent: *
Allow: /wp-content/uploads/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-content/plugins/
Sitemap: https://yourdomain.com/sitemap.xml
In addition to the “User-agent:”, “Disallow:” and “Allow” directives you can include any comments you want by putting the “#” sign at the beginning of the given line. Technically speaking the user-agent can be any party that requests web pages, including command line utilities, web browsers and of course, search engine spiders. If the “User-agent:” directive is followed by a wildcard operator – “*” – the given rule will apply to all the crawlers.
Finally, let's see ...
The 8 most common robots.txt mistakes that can ruin your SEO efforts
As you have seen, the structure and the syntax of the robots.txt file is ridiculously simple, but you should always keep in mind, that you are using a double edged sword!
One single wrongly used wildcard operator will keep away all the search engines from your website, so pay a great attention to every single character included in your robots.txt file, and don’t try to outsmart the search engines with uncommon directives or weird and shady practices learned from various forums, YouTube, etc. Believe me, I have seen many, many webmasters and site owners pulling their hair out after they have tried to enhance their robots.txt with some new awesome directives or smart aleck practices.
In order to keep you away from some potential epic disasters, I have made a small collection with the most common robots.txt mistakes. If you want to make a successful SEO for a WordPress website, you’ll have to play close attention to these issues, otherwise your entire website could simply vanish from the search results.
1. Using the robots.txt file inadequately
Your file MUST be called “robots.txt” all in lower-case, and you must use a new line for every single new instruction!
Also, a robots.txt file placed in a sub-directory will not work at all. Will be completely ignored. If you don’t have access to the root directory, and you want to block some pages in a sub-directory – for example your own individual directory on a membership site -, you’ll have to use other ways, such as robot meta-tags or an “.htaccess” file.
2. Targeting subdomains
Let’s assume that you have a website with multiple subdomains:
If you are going to create one “main” robots.txt file uploaded in the root folder of your website (www.yourdomain.com/robots.txt) in order to block the subdomains, won’t work. A “Disallow: blog/yourdomain.com” directive included in the “main” robots.txt file will have no effect.
In plain English: in order to block some subdomains – and not others! – you are going to need different robots.txt files served from the different subdomains.
3. Using incorrect type case in URL paths
The URL paths are case sensitive!
“Disallow: /Temp” will not block “/temp” or “/TEMP”.
If you have made the big mistake of using similar filenames or a confusing directory structure, you’ll have to block those pages or folders separately by using separate “Disallow:” lines for each.
4. Forgetting the user-agent directive
It may seem ridiculous, but it happens very often. If there is no user-agent directive before the usual “Disallow:” , “Allow:”, etc directives, nothing will actually happen!
5. Forgetting the slash character
Any URL path must start with a slash character! The “Disallow: any-page” directive won’t block anything. The correct syntax is this: “Disallow: /any-page”.
6. Using the robots.txt file to protect sensitive data
This is one of the biggest and most dangerous mistakes, which for obvious reasons will do much more harm than good. The only reliable way to protect your sensitive stuff is to use some sort of password-based security solution!
If you have any files or directories that must be kept protected and hidden from the public, do not ever just put them in your robots.txt file with some “Disallow:” directives!
Why? Because you are going to give hostile crawlers a precise road-map to find the folders and files that you don’t want them to find! More than that, your robots.txt is publicly accessible! Anybody can – and will – see the things you’ve said you don’t want indexed, simply by typing yourdomain.com/robots.txt into their browser!
7. Trying to block hostile crawlers or user-agents
I just used the term “hostile crawlers”. As I said earlier, a user-agent can be any party that requests web pages, and sadly, these can include email extractors, data scrapers, web harvesters, etc too.
Using robots.txt directives to block such hostile crawlers – for example, “User-agent: EmailSiphon” & “Disallow: /” – has become a quite common practice, but the truth is, that these practices have no effect at all, because the robots.txt file is strictly voluntary.
All those directives are simple guidelines nothing more. In other words, the crawlers are under no obligation to follow your rules. The search engine spiders and other “polite” crawlers will obey, but the malicious, hostile crawlers will simply ignore it. If you really want to block some nasty crawlers you are going to need some better solutions (for example, IP-blocking).
8. Using competing directives
Sounds confusing? It’s pretty simple: as you have already seen in a previous example, a given “User-agent:” directive can be followed by both a “Disallow:” and an “Allow:” directive. And it’s very logical, because the “Allow:” directive is used to specify an exception to a “Disallow:” rule. In the previously mentioned (third) example we have used the “Allow:” directive to unblock a certain page in a blocked directory.
And what if a given URL can match either of those two directives? Well, the truth is, that not all the spiders handle these competing directives exactly the same way. Google gives priority to the directive whose URL path is longer in term of character counts.
Wrapping it up
I will say it again: the structure and the syntax of a robots.txt file is simple, but still, serious and dangerous mistakes can be made very easily. If you want to make a successful SEO for a WordPress website, you should pay a great attention to the above aspects, and additionally, I strongly recommend you to use a checker or a validator before you make any changes.
You can use ...
Google Search Console
https://www.google.com/webmasters/tools/robots-testing-tool
Or ...
SEOChat
http://tools.seochat.com/tools/robots-txt-validator#sthash.7...
And of course, if you want, you can use online tools even to create the robots.txt file. Here is a simple, clean solution provided by InternetMarketingNinjas:
https://www.internetmarketingninjas.com/seo-tools/robots-txt...
OK. Now, that you already know what a robots.txt file is (and how easy is to completely remove your website from the search results with one simple directive), we can move forward to discover another vital element: the sitemap.