Sites with hundreds of thousands of pages, or even millions, may not ever get all their pages indexed. This is because of a concept sometimes referred to as “crawl equity”. Crawl equity refers to the fact that since crawling a site with millions of pages takes a significant amount of bandwidth for search engines, only a portion of those pages are likely to be indexed.
Nowadays search engine optimization is not just about slapping some keywords on a page and getting a bunch of inbound links. It’s developed in complexity as the web expands, more information is stored online and algorithms get more sophisticated. With this understanding, it is critical to pay just as much (if not more) attention to the back end as with the front end – at least with large sites. Richard Baxter, of SEOgadget just put out a great post surrounding the role of structured markup as it relates to the future of SEO. Just like Baxter believes (and I agree) that standards and consistency in a uniform markup will play a larger role in a search engine’s ability to rank and display relevant content, I believe that optimizing crawl equity is a critical factor in the SEO process.
So what is the goal of crawl equity optimization? To enable search engines to spend less time crawling duplicate content or empty pages and more time crawling and indexing valuable content. Google Webmaster Help has posted some tips, but I’ll break it down here as well.
Burning Crawl Equity – Common Causes
Common causes that result in engines having to unnecessarily crawl URLs mainly come down to URL structure and infinite spaces. This is because they create duplicate content and URLs that were not intended to do be indexed in the first place, ultimately leading to engines exhausting bandwidth trying to crawl them all.
Examples:
- Session IDs
- Sorting parameters
- Login pages
- Contact forms
- Pagination
- Calendars with a “next month” or “previous month” link
- Filtering search results
- Broken relative links
Suggested Solutions
By addressing the common sources where crawl equity is often wasted, you will increase the likelihood that the valuable content intended for indexing in fact gets indexed.
- Remove user-specific details from URLs by putting this information in a cookie and 301 redirecting to the source URL.
- Eliminate categories of dynamically generated links through robots.txt. Use advanced regular expressions to deal with complex URL strings.
- Nofollow calendar links
- Reduce duplicate content; utilize the rel=canonical tag
- Improve latency issues by reducing page load time
Diagnostics
Now this is great information for all to know and address. But how do you know there is an issue in the first place? For starters, Webmaster Tools may give you a warning report as follows:
Furthermore, an inurl search command coupled with a site search command will also do the trick to help assess the gravity of the situation. In other words, how many pages are contributing to engine bloat as a result of filters?
Finally, how to track and trend the success of your efforts? Consider tracking the % of indexed pages out of the total (intended to be indexed) pages on the site over time. You may also track this at a more granular level by calculating the % of problematic pages and trending over time as issues are addressed.
Rachel Andersen works for the Portland based SEM agency Anvil Media, Inc. She has expertise in all aspects of search engine marketing and specializes in SEO for large sites. Andersen has been responsible for the development and execution of dozens of search and social marketing campaigns over her time spent with Anvil.