3. Unintentionally blocking unrelated pages
Check out the following example:
The “$” sign is a so-called end-of-string operator which basically tells to the spider that “this URL ends here”. As a result, the given directive will match “/custom”, but not “/customized-email-templates”.
The worst thing with this type of mistake is that usually goes unnoticed for a very long time. The page with your customized email templates won’t exist anymore …
4. Using incorrect type case in URL paths
The URL paths are case sensitive! “Disallow: /Temp” will not block “/temp” or “/TEMP”. If you have made the big mistake of using similar filenames or a confusing directory structure, you’ll have to block those pages or folders separately by using separate “Disallow:” lines for each.
5. Forgetting the user-agent directive
It may seem ridiculous, but it happens very often. If there is no user-agent directive before the usual “Disallow:” , “Allow:”, etc directives, nothing will actually happen!
6. Forgetting the slash character
Any URL path must start with a slash character! The “Disallow: any-page” directive won’t block anything. The correct syntax is this: “Disallow: /any-page”.
7. Using the robots.txt file to protect sensitive data
This is one of the biggest and most dangerous mistakes, which for obvious reasons will do much more harm than good. The only reliable way to protect your sensitive stuff is to use some sort of password-based security solution! If you have any files or directories that must be kept protected and hidden from the public, do not ever just put them in your robots.txt file with some “Disallow:” directives!
Why? Because you are going to give hostile crawlers a precise road-map to find the folders and files that you don’t want them to find! More than that, your robots.txt is publicly accessible! Anybody can – and will – see the things you’ve said you don’t want indexed, simply by typing yourdomain.com/robots.txt into their browser!