Advertisement
  1. SEJ
  2.  ⋅ 
  3. Tools

Google’s John Mueller: “It’s Impossible To Crawl The Whole Web”

Google’s John Mueller explains why it’s impossible to crawl and discover every URL on the web.

Google’s John Mueller: “It’s Impossible To Crawl The Whole Web”

In response to a question about why SEO tools don’t show all backlinks, Google’s Search Advocate John Mueller says it’s impossible to crawl the whole web.

This is stated in a comment on Reddit in a thread started by a frustrated SEO professional.

They ask why all links pointing to a site aren’t getting found by an SEO tool they’re using.

Which tool the person is using isn’t important. As we learn from Mueller, it’s not possible for any tool to discover 100% of a website’s inbound links.

Here’s why.

There’s No Way To Crawl The Web “Properly”

Mueller says there’s no objectively correct way to crawl the web because it has an infinite number of URLs.

No one has the resources to keep an endless amount of URLs in a database, so web crawlers try to determine what’s worth crawling

As Mueller explains, that inevitably leads to URLs getting crawled infrequently or not at all.

“There’s no objective way to crawl the web properly.

It’s theoretically impossible to crawl it all, since the number of actual URLs is effectively infinite. Since nobody can afford to keep an infinite number of URLs in a database, all web crawlers make assumptions, simplifications, and guesses about what is realistically worth crawling.

And even then, for practical purposes, you can’t crawl all of that all the time, the internet doesn’t have enough connectivity & bandwidth for that, and it costs a lot of money if you want to access a lot of pages regularly (for the crawler, and for the site’s owner).

Past that, some pages change quickly, others haven’t changed for 10 years – so crawlers try to save effort by focusing more on the pages that they expect to change, rather than those that they expect not to change.”

How Web Crawlers Determine What’s Worth Crawling

Mueller goes on to explain how web crawlers, including search engines and SEO tools, figure out which URLs are worth crawling.

“And then, we touch on the part where crawlers try to figure out which pages are actually useful.

The web is filled with junk that nobody cares about, pages that have been spammed into uselessness. These pages may still regularly change, they may have reasonable URLs, but they’re just destined for the landfill, and any search engine that cares about their users will ignore them.

Sometimes it’s not just obvious junk either. More & more, sites are technically ok, but just don’t reach “the bar” from a quality point of view to merit being crawled more.”

Web Crawlers Work With A Limited Set Of URLs

Mueller concludes his response saying all web crawlers work on a “simplified” set of URLs.

Since there’s no correct way to crawl the web, as mentioned previously, every SEO tool has its own way of deciding which URLs are worth crawling.

That’s why one tool may discover backlinks that another tool didn’t find.

“Therefore, all crawlers (including SEO tools) work on a very simplified set of URLs, they have to work out how often to crawl, which URLs to crawl more often, and which parts of the web to ignore. There are no fixed rules for any of this, so every tool will have to make their own decisions along the way. That’s why search engines have different content indexed, why SEO tools list different links, why any metrics built on top of these are so different.”


Source: Reddit

Featured Image: rangizzz/Shutterstock

Category News Tools
ADVERTISEMENT
SEJ STAFF Matt G. Southern Senior News Writer at Search Engine Journal

Matt G. Southern, Senior News Writer, has been with Search Engine Journal since 2013. With a bachelor’s degree in communications, ...