Last June 2018, at the SMX Advanced conference in Seattle, I announced that over the next 18 months my team will focus on improving our crawler Bingbot.
Let me take the opportunity of this post to share our progress and learnings on this journey.
Why Optimize Crawling?
First things first, let me explain why search engines must crawl the web and the challenges they are facing.
Bing’s crawler, Bingbot is a key component of the Bing platform. Bingbot’s main function is to:
- Download webpages to get the latest content and discover new links from existing known links.
- Verify that web documents already indexed are still valid, not dead links, helping to keep the Bing index comprehensive and fresh to answer customer queries with relevant results.
For example, Bing customers searching for the latest space rocket launch can search and find new relevant webpages only seconds after this rocket launch. To be able to link to these new URLs, we have to discover, select, crawl, process, and then index them.
To discover these new URLs, we have to crawl regularly existing known URLs to monitor for new URLs.
Once discovered, we have to crawl to get the content for these new URLs.
We have to continue crawling these newly indexed URLs regularly to check for potential content changes and verify that these webpages are still valid, not dead links.
In other words, we crawl each URL in our system more than once.
Keeping Bing’s index fresh and comprehensive is a fascinating challenge for two reasons:
Large Scale
The world wide web is huge and keeps growing at a rapid pace. Every day, my team discovers more than 100 billion new URLs never seen before, while useless URLs parameters are ignored.
While many of these new URLs are useless, some are great URLs with relevant content for our Bing customers.
Which URLs should be fetched or should not be fetched?
Diversity
Websites are:
- Built on diverse content management systems including custom solutions.
- Hosted on diverse web hosting companies and content delivery networks.
- Managed by diverse people having different goals related to search engines.
How should each case be handled?
We’ve heard occasional concerns from website owners that Bingbot doesn’t crawl their site frequently and fast enough.
We’ve also heard concerns that sometimes Bingbot crawls websites too often.
Crawling right is a fascinating engineering problem that hasn’t fully been solved yet. So we are focusing on improving and solving it globally.
What Are We Optimizing?
Before drilling into what my team is doing to improve our crawler, let me share the key metrics we are optimizing.
To satisfy the need for content freshness and comprehensiveness, my team at Bing must have an effective and efficient crawl scheduling policy that obeys websites download constraints. An efficient solution that can:
- Scale and handle the diversity of hundreds of millions of hosts and the billions of webpages that Bingbot crawls daily.
- Satisfy all actors – webmasters, websites, and content management systems – while handling site downtimes and ensuring that we aren’t crawling too frequently.
Our crawler’s performance can be measured via three core metrics:
Crawl Effectiveness
Every page in Bing’s index should be a fresh copy of its web version. Webpages change more often than most webmasters think:
- Prices of products sold may change daily.
- Weather pages are changing generally once a day for every city in world.
- Copyright dates change every year.
- Interstitial ads may inject HTML within the page.
- Time in Seattle webpages changes every second.
- Changes in schema content are not visible to a website visitor’s eyes!
Distinguishing meaningful content changes is not as easy as people think.
Crawl Efficiency
We crawl only updated (fresh on-page content/useful outbound links) or new URLs.
Ideally, we will crawl a new URL as soon as the content goes live, and crawl them again only once after the content of the webpage is updated or if they become dead links or redirect.
Unfortunately, we have limited to no signals about content changes on some sites. On these sites, we crawl blindly only to discover that the content has changed.
Obey Website Politeness Constraints
We never crawl more often than webmasters want.
The problem is that website owners have different SEO needs and engage more or less with search engines.
While some sites owners inform Bing about their daily crawl quotas via Bing Webmaster tools, most sites do not. In turn, the search engine is forced to guess the quota to allot.
Speaking with webmasters, we observed that they have different needs.
Some ask to crawl all their pages daily to ensure we have the latest content all the time, while some webmasters request that we crawl only updated content.
How Are We Optimizing?
The challenge for Bingbot is that it can’t fetch webpages only once.
As I mentioned above, once a page is published, we have to fetch it regularly to discover if the content has been updated and that it is not a dead link.
Defining what and when to fetch next is the problem we are looking at optimizing with your help.
As computers make excellent and efficient servants, we are leveraging them to model what and when to crawl URLs.
But as we do not wish to fully rely on computers, webmasters and my team have the final control on how much URLs per day we can crawl on the site.
Our default crawl policy is to be as polite as possible when crawling the web.
To optimize we focus our investment in two areas:
Identifying Patterns to Allow Bingbot to Reduce Crawl Frequency
On most sites, while new webpages may be posted daily, and some pages are updated on a regular basis, most of the content is generally not edited for months and even years.
Sites increase in size with new and updated content, without substantially changing the content of the previous pages.
Better modeling and understanding of content changes per site is one of my team’s core goals. We’ve already improved on many sites and far more improvements are coming.
Leveraging Webmaster Hints
When we leverage sources as feeds (Atom, RSS) and sitemaps to discover new and updated URLs, we still need to pull these URLs often to discover new URLs – when often, nothing has changed.
Also, we just announced the ability to get webmasters’ content indexed fast by submitting up to 10,000 URLs per day to Bing.
This is a significant increase in the number of URLs a webmaster can submit per day, to get their content crawled and indexed. This is a powerful signal for us on site adopting to throttle crawling.
If you tell us about each change, this is limiting the need to crawl to discover such changes and you will get your content indexed fast.
So we encourage everyone to integrate Bing Webmaster APIs preferably in your Content Management Systems to tell us real time about your content change and avoid having crawlers waste crawling on content which didn’t change. Yoast announced its support for this API.
Star Trek’s Spock said that one can begin to reshape the landscape with a single flower. I do believe that this URL submission API is a right step that will trigger a reshape of the crawl landscape, moving the industry ahead, saving earth from crawl global warming.
You can test Submit URL API in two easy steps:
Step 1: Get your Bing Webmaster tools API ID for your site.
Step 2: Submit new URLs for your site.
Example using wget. Replace the ID by your API id , the siteUrl by your site Url, and the bing URL by an URL of your site.
wget.exe “https://ssl.bing.com/webmaster/api.svc/pox/SubmitUrl?apikey=7737def21c404dcdaf23bea715e61436” –header=”Content-Type: application/xml; charset=utf-8″ –post-data=”<SubmitUrl xmlns=\”http://schemas.datacontract.org/2004/07/Microsoft.Bing.Webmaster.Api\”><siteUrl>http://www.bing.com</siteUrl><url>http://www.bing.com/fun/?test=test</url></SubmitUrl>”
For sites not yet able to adopt the Submit URL API, we will continue leveraging and improving crawl scheduling on existing content signals to learn about content change and optimize it.
One best practice is to have a sitemap listing all the relevant URLs on your site and refreshing it at least once a day, as well as RSS feeds listing your new URLs and URLs with updated content.
We also recommend that you submit your sitemaps and RSS once in Bing Webmaster tools to make sure we are aware of them and to check your analytics as new URLs are discovered.
Once submitted, we will monitor them regularly (in most cases, at least once a day) moving forward.
Conclusion
While we are making progress, our work with improving crawler efficiency is not yet done.
We still have plenty of opportunities ahead of us to continue to improve our crawler’s efficiency and abilities across the hundreds of different types of data used to improve our crawler scheduling algorithm.
More Resources: