Advertisement
  1. SEJ
  2.  ⋅ 
  3. SEJ Show

Understanding TF*IDF: One of Google’s Earliest Ranking Factors

In this Marketing Nerds episode, Brent Csutoras sits down with Marcus Tandler of OnPage.org to talk about TF*IDF, one of Google's earliest ranking factor.

Understanding TF*IDF: One of Google’s Earliest Ranking Factors

Visit our Marketing Nerds archive to listen to other Marketing Nerds podcasts!

In this week’s episode of Marketing Nerds, I am joined by Marcus Tandler, one of the most accomplished SEOs I know and also the co-founder of OnPage.org, a leading Technical SEO Tool. We talk about how Marcus got into SEO, share some funny stories, and then dive into what TF*IDF is, how to use it and why it is so important in SEO today.

marcus-tandler-tdp-onpage-org-karlscore (1)

Here are a few transcribed excerpts from our discussion, but make sure to listen to the podcast to hear everything:

What is OnPage.org?

OnPage.org is a website quality management software and also, right now, the leading technical SEO software. We’ve got about 180,000 users around the world using our software. We’re especially successful here in Germany, and actually, we just started internationalizing our software a couple of months ago. Here in Germany, it’s over 50% of the top 100 websites in Germany are using our software. It’s really about technical SEO. Basically, a crawler-based software. We can crawl really deep.

What are the Key Elements of Page Optimization?

There’s so much stuff that it’s not about individual things. It’s really like being a 100% perfect website. It’s not a technical-driven approach anymore, like back in the days. It’s really about having a perfect website. It’s got to be crawlable so everybody can go very smoothly to the page. You have to have good content. It needs to be fast, fast-loading, so not just a fast server, but also optimized on every angle for speed optimization, and also that it’s omni-device friendly so that it’s the same user experience on all desktop, mobile and tablet devices.

There’s so many angles that go into this it’s really about building 100% perfect website. That’s what it’s about and this is what OnPage.org is helping you with and, as I said, over 180,000 users are, right now, trusting the innovation of our software.

What is TF*IDF and Why Should We Care About It?

Okay, so first of all, TF*IDF has been one of the most dominating topics in the German SEO scene for the past four years and for one reason: because it works. It’s picking up a little in the US right now. It started with an article from Dr. Pete on the Moz blog, but he was doing a very complicated version with Excel, and it is not very usable. But I think the newest keyword tool in Moz actually offers a little bit of TFIDF and Searchmetrics is doing something similar. This is one of our modules for 4 years now, so it’s actually also a third generation TF*IDF tool now because, as I said, for the past four years it’s a very big topic in Germany and I think a lot of people are using our TF*IDF module for content optimization, so it’s a very popular tool here.

Basically, TF*IDF stands for Term Frequency with Inverse Document Frequency. There’s also a dampening factor in there. It’s a mathematical formula, which is from the ’80s, so it’s really like a basis of information retrieval. It’s something very, very old, which can be used to reverse engineer how search engines view content. TF*IDF has been one of the earliest ranking factors in most of the search engines, I think, with Yandex it was the third ranking factor overall, and also with Google it was very early on that they leveraged TF*IDF for their advantage, but it was not really about ranking back then. The main purpose of TF*IDF is, for example, the identification of homonyms, so a word that has different meanings.

Here, TF*IDF can have a lot because you look at the whole article to find out what this word could stand for. If you have an article that is talking about programming C++, then Java … Did I already say that I’m using the example Java? I’m sorry, so Java is the homonym here, so if I have an article about programming C++, Java is probably meant as the programming language, Java. If the article’s talking about surfing beach Bali, it’s probably the island of Java, and if the article is about coffee and taste and beans, it’s probably java the coffee.

This is how a search engine can determine what a homonym stands for in this particular context. It actually helps a lot with ranking as well,  because there are so many search queries you just don’t see very often, or that you may not ever heard of before. I think these days it’s 1/3 of all the search queries Google is getting each day they have never heard before because there is just so many spoken search queries these days.

Also, kern stuff, and in this case, you just don’t have user intense signals and a whole backlog of history to know what is the best result for the user in each case. You have to determine this on the fly and TFIDF can help because, with search queries that Google doesn’t see very often, Google favors the most holistic content.

Onpage.org has a great article explaining TF*IDF here.

How Can You Use TF*IDF to Improve Your SEO?

It’s like a keyword research, a keyword-inspiration tool. That’s what it’s all about, right? You get keyword inspiration from terms and concepts that could be added to your page to enrich it, make it a more holistic results. That’s what it’s all about. It’s not really about tactic driven, write this on your page, write that on the page, and three times this term and five times this term. That’s not what it’s about.

It’s not about gaming search engines. It’s more about keyword inspiration. For example, if you do a TF*IDF analysis for smart phone in the US you get terms like iPhone, Android, Samsung, and stuff like this.

This is very easy and this is also something where adding a word won’t get your ranking any higher, but we’re talking about the long tail stuff. If I do the same smart phone analysis in the English language, but for target market Japan, it will show a high correlation for the term waterproof because, interestingly enough, I think 95% of all smart phones being sold in Asia are waterproof for whatever reason. Maybe they need to have it in the shower.

This is, I think, a prime example because Asian people are very keen on having waterproof phones, so it’s very important to add the term waterproof to your pages because there seem to be a lot of people searching for waterproof, in combination with smart phone. It’s really about not just the language, but also the target market.

To Tool or Not to Tool?

Doing it by hand is very complicated. As I said, Dr. Pete had an article about it, doing this with Excel, and it was highly complicated. It definitely makes sense to have a tool for this because the most important thing is the corpus.

The corpus is where you’re actually extracting information from. Of course, you can just scrape Google results, but you will run into different problems, so it’s really about building a corpus for a specific language and a specific target market, and not just scraping Google.

Then also extracting the information. You’re looking at all the pages that are ranking well for whatever you’re searching for and then you extract the unique terms, and then you have to weigh them because it’s not just, “Okay, these are all the terms that’s been used, so just take whatever you need.” It’s really about weighing them. What seems to be really important? That is what it’s really about.

Prioritizing TF*IDF in SEO Efforts

I can just give you an example. In Germany here it is really on everybody’s mind. Really. Everybody in SEO, like probably 80% of the people in SEO, will have heard about TF*IDF.

It’s very important for large newspapers or small affiliates. I think everybody is using that to their advantage, and this is why I’m a little bit amazed that it really hasn’t picked up on in the whole world, but it’s still, of course, a little bit complicated at first and also it has a sort of a voodoo thing to it.

People will just say, “Oh, no. It’s just a tactic back from the days and this is not what it’s about these days,” but these are the people who just don’t understand it correctly and just are not using it right because, as I said, it should only be used as a keyword inspirational tool, and this what people are actually using it here for. This is where it really serves a lot of value.

Really, I have a lot of examples where you really see content without any links getting promoted to the first page just by providing a more holistic result for a specific term. Because it’s really, these days, SEO’s about understanding the search intent, understanding the user intent to basically really serve them the result that serves the user best. It’s about which is the best result for the user, and this what you’re basically trying to reverse engineer and create a better page for the user, so this is really not voodoo.

This is just another tool you can use for content optimization and it really works.

To listen to this Marketing Nerds podcast with Marcus Tandler & Brent Csutoras:

Think you have what it takes to be a Marketing Nerd? If so, message Kelsey Jones on Twitter, or email her at kelsey [at] searchenginejournal.com.

Visit our Marketing Nerds archive to listen to other Marketing Nerds podcasts!

How to Use TFIDF to Improve Your SEOImage Credits

Featured Image: Image by Paulo Bobita
In-post Photo: Image by Marcus Tandler. Used with permission.

Category SEJ Show
ADVERTISEMENT
SEJ STAFF Brent Csutoras Managing Partner / Owner at Search Engine Journal

Managing Partner / Owner at Search Engine Journal with over 18 years experience in Digital Marketing, specializing in Reddit, Search ...