Google’s Gary Illyes confirmed a common observation that robots.txt has limited control over unauthorized access by crawlers. Gary then offered an overview of access controls that all SEOs and website owners should know.
Microsoft Bing’s Fabrice Canel commented on Gary’s post by affirming that Bing encounters websites that try to hide sensitive areas of their website with robots.txt, which has the inadvertent effect of exposing sensitive URLs to hackers.
Canel commented:
“Indeed, we and other search engines frequently encounter issues with websites that directly expose private content and attempt to conceal the security problem using robots.txt.”
Common Argument About Robots.txt
Seems like any time the topic of Robots.txt comes up there’s always that one person who has to point out that it can’t block all crawlers.
Gary agreed with that point:
“robots.txt can’t prevent unauthorized access to content”, a common argument popping up in discussions about robots.txt nowadays; yes, I paraphrased. This claim is true, however I don’t think anyone familiar with robots.txt has claimed otherwise.”
Next he took a deep dive on deconstructing what blocking crawlers really means. He framed the process of blocking crawlers as choosing a solution that inherently controls or cedes control to a website. He framed it as a request for access (browser or crawler) and the server responding in multiple ways.
He listed examples of control:
- A robots.txt (leaves it up to the crawler to decide whether or not to crawl).
- Firewalls (WAF aka web application firewall – firewall controls access)
- Password protection
Here are his remarks:
“If you need access authorization, you need something that authenticates the requestor and then controls access. Firewalls may do the authentication based on IP, your web server based on credentials handed to HTTP Auth or a certificate to its SSL/TLS client, or your CMS based on a username and a password, and then a 1P cookie.
There’s always some piece of information that the requestor passes to a network component that will allow that component to identify the requestor and control its access to a resource. robots.txt, or any other file hosting directives for that matter, hands the decision of accessing a resource to the requestor which may not be what you want. These files are more like those annoying lane control stanchions at airports that everyone wants to just barge through, but they don’t.
There’s a place for stanchions, but there’s also a place for blast doors and irises over your Stargate.
TL;DR: don’t think of robots.txt (or other files hosting directives) as a form of access authorization, use the proper tools for that for there are plenty.”
Related: 8 Common Robots.txt Issues And How To Fix Them
Use The Proper Tools To Control Bots
There are many ways to block scrapers, hacker bots, search crawlers, visits from AI user agents and search crawlers. Aside from blocking search crawlers, a firewall of some type is a good solution because they can block by behavior (like crawl rate), IP address, user agent, and country, among many other ways. Typical solutions can be at the server level with something like Fail2Ban, cloud based like Cloudflare WAF, or as a WordPress security plugin like Wordfence.
Read Gary Illyes post on LinkedIn:
robots.txt can’t prevent unauthorized access to content
Featured Image by Shutterstock/Ollyy