Inverse Document Frequency and the Importance of Uniqueness
Wednesday, May 13th, 2015Posted by EricEnge
In my last column, I wrote about how to use term frequency analysis in evaluating your content vs. the competition’s. Term frequency (TF) is only one part of the TF-IDF approach to information retrieval. The other part is inverse document frequency (IDF), which is what I plan to discuss today.
Today’s post will use an explanation of how IDF works to show you the importance of creating content that has true uniqueness. There are reputation and visibility reasons for doing this, and it’s great for users, but there are also SEO benefits.
If you wonder why I am focusing on TF-IDF, consider these words from a Google article from August 2014: “This is the idea of the famous TF-IDF, long used to index web pages.” While the way that Google may apply these concepts is far more than the simple TF-IDF models I am discussing, we can still learn a lot from understanding the basics of how they work.
What is inverse document frequency?
In simple terms, it’s a measure of the rareness of a term. Conceptually, we start by measuring document frequency. It’s easiest to illustrate with an example, as follows:
In this example, we see that the word “a” appears in every document in the document set. What this tells us is that it provides no value in telling the documents apart. It’s in everything.
Now look at the word “mobilegeddon.” It appears in 1,000 of the documents, or one thousandth of one percent of them. Clearly, this phrase provides a great deal more differentiation for the documents that contain them.
Document frequency measures commonness, and we prefer to measure rareness. The classic way that this is done is with a formula that looks like this:
For each term we are looking at, we take the total number of documents in the document set and divide it by the number of documents containing our term. This gives us more of a measure of rareness. However, we don’t want the resulting calculation to say that the word “mobilegeddon” is 1,000 times more important in distinguishing a document than the word “boat,” as that is too big of a scaling factor.
This is the reason we take the Log Base 10 of the result, to dampen that calculation. For those of you who are not mathematicians, you can loosely think of the Log Base 10 of a number as being a count of the number of zeros – i.e., the Log Base 10 of 1,000,000 is 6, and the log base 10 of 1,000 is 3. So instead of saying that the word “mobilegeddon” is 1,000 times more important, this type of calculation suggests it’s three times more important, which is more in line with what makes sense from a search engine perspective.
With this in mind, here are the IDF values for the terms we looked at before:
Now you can see that we are providing the highest score to the term that is the rarest.
What does the concept of IDF teach us?
Think about IDF as a measure of uniqueness. It helps search engines identify what it is that makes a given document special. This needs to be much more sophisticated than how often you use a given search term (e.g. keyword density).
Think of it this way: If you are one of 6.78 million web sites that comes up for the search query “super bowl 2015,” you are dealing with a crowded playing field. Your chances of ranking for this term based on the quality of your content are pretty much zero.
Overall link authority and other signals will be the only way you can rank for a term that competitive. If you are a new site on the landscape, well, perhaps you should chase something else.
That leaves us with the question of what you should target. How about something unique? Even the addition of a simple word like “predictions”—changing our phrase to “super bowl 2015 predictions”—reduces this playing field to 17,800 results.
Clearly, this is dramatically less competitive already. Slicing into this further, the phrase “super bowl 2015 predictions and odds” returns only 26 pages in Google. See where this is going?
What IDF teaches us is the importance of uniqueness in the content we create. Yes, it will not pay nearly as much money to you as it would if you rank for the big head term, but if your business is a new entrant into a very crowded space, you are not going to rank for the big head term anyway
If you can pick out a smaller number of terms with much less competition and create content around those needs, you can start to rank for these terms and get money flowing into your business. This is because you are making your content more unique by using rarer combinations of terms (leveraging what IDF teaches us).
Summary
People who do keyword analysis are often wired to pursue the major head terms directly, simply based on the available keyword search volume. The result from this approach can, in fact, be pretty dismal.
Understanding how inverse document frequency works helps us understand the importance of standing out. Creating content that brings unique angles to the table is often a very potent way to get your SEO strategy kick-started.
Of course, the reasons for creating content that is highly differentiated and unique go far beyond SEO. This is good for your users, and it’s good for your reputation, visibility, AND also your SEO.
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!