8. Trying to block hostile crawlers or user-agents
I just used the term “hostile crawlers”. As I said earlier, a user-agent can be any party that requests web pages, and sadly, these can include email extractors, data scrapers, web harvesters, etc too.
Using robots.txt directives to block such hostile crawlers – for example, “User-agent: EmailSiphon” & “Disallow: /” – has become a quite common practice, but the truth is, that these practices have no effect at all, because the robots.txt file is strictly voluntary. All those directives are simple guidelines nothing more. In other words, the crawlers are under no obligation to follow your rules. The search engine spiders and other “polite” crawlers will obey, but the malicious, hostile crawlers will simply ignore it. If you really want to block some nasty crawlers you are going to need some better solutions (for example, IP-blocking).
9. Using competing directives
Sounds confusing? It’s pretty simple: as you have already seen in a previous example, a given “User-agent:” directive can be followed by both a “Disallow:” and an “Allow:” directive. And it’s very logical, because the “Allow:” directive is used to specify an exception to a “Disallow:” rule. In the previously mentioned (third) example we have used the “Allow:” directive to unblock a certain page in a blocked directory.
And what if a given URL can match either of those two directives? Well, the truth is, that not all the spiders handle these competing directives exactly the same way. Google gives priority to the directive whose URL path is longer in term of character counts. Let’s see an example:
And another one:
Is really not that complicated, isn’t it?
10. Matching the “$” sign
Let’s assume that you need to block any URLs that contain the “$” character. For example: yourdomain.com/services/monthly-subscription?price=$20. Your first idea probably would be something like this: “Disallow: /*$”.
It would be an extremely bad decision, because the above rule will actually block everything on your site! As I said earlier, the “$” character is used as an end-of-string operator and as a result the above directive will block any URL whose path starts with a slash, followed by zero or more characters. In other words, is a rule which applies to any valid URL, therefore the directive will block literally anything.
Relax, the solution is very simple. All you have to do, is to use an additional asterisk after the dollar sign: “Disallow: /*$*”. In this way the dollar sign is no longer at the end of the URL, so it loses its unique meaning.
Wrapping it up
I will say it again: the structure and the syntax of a robots.txt file is simple, but still, serious and dangerous mistakes can be made very easily.
If you want to make a successful SEO for a WordPress website, you should pay a great attention to the above aspects, and additionally, I strongly recommend you to use a checker or a validator before you hit the upload button. There are many free robots.txt checkers & validators out there. For example:
http://tools.seochat.com/tools/robots-txt-validator#sthash.X1CFJhIM.dpbs
And that's it my friends!
If you have any comments, further questions or update requests please don't hesitate to react! Like, comment and share!