How to improve your SEO with the robots.txt file

8. Trying to block hostile crawlers or user-agents

I just used the term “hostile crawlers”. As I said earlier, a user-agent can be any party that requests web pages, and sadly, these can include email extractors, data scrapers, web harvesters, etc too.

Using robots.txt directives to block such hostile crawlers – for example, “User-agent: EmailSiphon” & “Disallow: /” – has become a quite common practice, but the truth is, that these practices have no effect at all, because the robots.txt file is strictly voluntary. All those directives are simple guidelines nothing more. In other words, the crawlers are under no obligation to follow your rules. The search engine spiders and other “polite” crawlers will obey, but the malicious, hostile crawlers will simply ignore it. If you really want to block some nasty crawlers you are going to need some better solutions (for example, IP-blocking).

9. Using competing directives

Sounds confusing? It’s pretty simple: as you have already seen in a previous example, a given “User-agent:” directive can be followed by both a “Disallow:” and an “Allow:” directive. And it’s very logical, because the “Allow:” directive is used to specify an exception to a “Disallow:” rule. In the previously mentioned (third) example we have used the “Allow:” directive to unblock a certain page in a blocked directory.

And what if a given URL can match either of those two directives? Well, the truth is, that not all the spiders handle these competing directives exactly the same way. Google gives priority to the directive whose URL path is longer in term of character counts. Let’s see an example:

And another one:

Is really not that complicated, isn’t it?

10. Matching the “$” sign

Let’s assume that you need to block any URLs that contain the “$” character. For example: yourdomain.com/services/monthly-subscription?price=$20. Your first idea probably would be something like this: “Disallow: /*$”.

It would be an extremely bad decision, because the above rule will actually block everything on your site! As I said earlier, the “$” character is used as an end-of-string operator and as a result the above directive will block any URL whose path starts with a slash, followed by zero or more characters. In other words, is a rule which applies to any valid URL, therefore the directive will block literally anything.

Relax, the solution is very simple. All you have to do, is to use an additional asterisk after the dollar sign: “Disallow: /*$*”. In this way the dollar sign is no longer at the end of the URL, so it loses its unique meaning.

Wrapping it up

I will say it again: the structure and the syntax of a robots.txt file is simple, but still, serious and dangerous mistakes can be made very easily.

If you want to make a successful SEO for a WordPress website, you should pay a great attention to the above aspects, and additionally, I strongly recommend you to use a checker or a validator before you hit the upload button. There are many free robots.txt checkers & validators out there. For example:

http://tools.seochat.com/tools/robots-txt-validator#sthash.X1CFJhIM.dpbs

And that's it my friends!

If you have any comments, further questions or update requests please don't hesitate to react! Like, comment and share!

Like This 31

Join the Discussion

Write something…

Recent messages

TommyVTE Premium

great training need to sit down for this to see how and what for my site, thanks

Reply Like

smartketeer Premium

@TommyVTE

Thanks Tommy!

Reply Like

suzzziq Premium

This is totally Greek to me!!! I get the basic premise, but unsure how to implement. I'm flagging it for future reference, in case I ever get brave enough to try this! Thanks so much for the training:)
Blessings:)
Suzi

Reply Like

smartketeer Premium

@suzzziq

Thanks for your time and your feedback Suzi!

Reply Like

FKelso Premium

Gee, where did you learn so much stuff?

Guess I have to go first and see if I have a robots.txt file.

Reply Like

smartketeer Premium

@FKelso

You have Fran ...

The question: what it contains?

Gee ... That's a LOOOOOOOOOONG story :)

Reply Like

FKelso Premium

@smartketeer

You always give me a chuckle. Thanks.

Reply Like

lesabre Premium

Thanks again, got to save this and come back to it. Lot of information that can be very helpful to me. Got to answer all those e-mails first. All the best.

Reply Like

smartketeer Premium

@lesabre

Thanks Michael!

All the best!

Reply Like

dowj01 Premium

Your training certainly helps make a subject which as a newbie seemed beyond me, very clear. Thank you.
Justin

Reply Like

smartketeer Premium

@dowj01

Thanks Justin!

Reply Like

My Favorites

How to improve your SEO with the robots.txt file

Tagged