8. Trying to block hostile crawlers or user-agents

I just used the term “hostile crawlers”. As I said earlier, a user-agent can be any party that requests web pages, and sadly, these can include email extractors, data scrapers, web harvesters, etc too.

Using robots.txt directives to block such hostile crawlers – for example, “User-agent: EmailSiphon” & “Disallow: /” – has become a quite common practice, but the truth is, that these practices have no effect at all, because the robots.txt file is strictly voluntary. All those directives are simple guidelines nothing more. In other words, the crawlers are under no obligation to follow your rules. The search engine spiders and other “polite” crawlers will obey, but the malicious, hostile crawlers will simply ignore it. If you really want to block some nasty crawlers you are going to need some better solutions (for example, IP-blocking).

9. Using competing directives

Sounds confusing? It’s pretty simple: as you have already seen in a previous example, a given “User-agent:” directive can be followed by both a “Disallow:” and an “Allow:” directive. And it’s very logical, because the “Allow:” directive is used to specify an exception to a “Disallow:” rule. In the previously mentioned (third) example we have used the “Allow:” directive to unblock a certain page in a blocked directory.

And what if a given URL can match either of those two directives? Well, the truth is, that not all the spiders handle these competing directives exactly the same way. Google gives priority to the directive whose URL path is longer in term of character counts. Let’s see an example:

And another one:

Is really not that complicated, isn’t it?

10. Matching the “$” sign

Let’s assume that you need to block any URLs that contain the “$” character. For example: yourdomain.com/services/monthly-subscription?price=$20. Your first idea probably would be something like this: “Disallow: /*$”.

It would be an extremely bad decision, because the above rule will actually block everything on your site! As I said earlier, the “$” character is used as an end-of-string operator and as a result the above directive will block any URL whose path starts with a slash, followed by zero or more characters. In other words, is a rule which applies to any valid URL, therefore the directive will block literally anything.

Relax, the solution is very simple. All you have to do, is to use an additional asterisk after the dollar sign: “Disallow: /*$*”. In this way the dollar sign is no longer at the end of the URL, so it loses its unique meaning.

Wrapping it up

I will say it again: the structure and the syntax of a robots.txt file is simple, but still, serious and dangerous mistakes can be made very easily.


If you want to make a successful SEO for a WordPress website, you should pay a great attention to the above aspects, and additionally, I strongly recommend you to use a checker or a validator before you hit the upload button. There are many free robots.txt checkers & validators out there. For example:


http://tools.seochat.com/tools/robots-txt-validator#sthash.X1CFJhIM.dpbs


And that's it my friends!

If you have any comments, further questions or update requests please don't hesitate to react! Like, comment and share!



Join the Discussion
Write something…
Recent messages
TommyVTE Premium
great training need to sit down for this to see how and what for my site, thanks
Reply
smartketeer Premium
Thanks Tommy!
Reply
suzzziq Premium
This is totally Greek to me!!! I get the basic premise, but unsure how to implement. I'm flagging it for future reference, in case I ever get brave enough to try this! Thanks so much for the training:)
Blessings:)
Suzi
Reply
smartketeer Premium
Thanks for your time and your feedback Suzi!
Reply
FKelso Premium
Gee, where did you learn so much stuff?

Guess I have to go first and see if I have a robots.txt file.
Reply
smartketeer Premium
You have Fran ...

The question: what it contains?

Gee ... That's a LOOOOOOOOOONG story :)
Reply
FKelso Premium
You always give me a chuckle. Thanks.
Reply
lesabre Premium
Thanks again, got to save this and come back to it. Lot of information that can be very helpful to me. Got to answer all those e-mails first. All the best.
Reply
smartketeer Premium
Thanks Michael!

All the best!
Reply
dowj01 Premium
Your training certainly helps make a subject which as a newbie seemed beyond me, very clear. Thank you.
Justin
Reply
smartketeer Premium
Thanks Justin!
Reply
Top