A Document's similarity and difference





March 23 2019



In this post we explain how applying basic Natural Language Processing techniques is used to create a similarity-difference vector so that we can compare documents to one another.





How do we compare the similarity of documents?

We compare the tf-idf vector of one document to another document's vector. What is tf-idf? It stands for term frequency-inverse document frequency. Still confused? Let’s break it down into simple terms.


The tf part, in relation to text-based documents, represents how often certain words are used. In the search engine world, this is used to grab common key words and phrases from websites to help the user find the correct information they’re searching for. tf can also be used to weed out or identify “stop words,” which are connecting words, such as “a, the, an, is…” In most cases, these are not necessarily the keywords you want to pull out, but rather the words you want the search engine to forget or ignore when performing its search. There are some exceptions to this rule, which we can get into later.


In many cases, finding documents with commonly searched words is helpful, but not always.


That brings us to the idf part of the acronym. If tf is about finding similarities, idf is about finding the differences. While it’s important to know what’s similar in two documents, it’s sometimes equally important to learn what sets them apart.


To return once again to “stop words,” the idf part is important for offsetting these common, “throwaway” terms that, while important to the English language, don’t add weight to the substance of a document, yet are featured most frequently in written documents and text. The Inverse Document Frequency highlights words that are used infrequently, so even if the word “the” appears 100 times, it can be offset by the use of infrequent terms used.


Stop words explained

As promised, we will elaborate on the concept of “Stop words.” While in many cases, you want tf-idf to exclude stop words from the search, there are cases where you want them included. It’s all about context. It should also be noted that there is no universal list of definitively accepted “stop words.” For context, some searches look for key phrases rather than keywords, so “stop words” are specifically not ignored so the sentence or phrase can be recognized as a whole. See the figure below.





How do you set this up?

In order to apply all the rules you’ve determined you need – such as what to look for and what to ignore – you need to prepare your documents to be processed.


This is all about machine learning. Before it can learn, you have to teach it what to look for. Tokenizing is an important step in the prepping phase. This is essentially pointing out important words or strings of words to look for. A multi-word token is better than single words broken up. Sure, you could have the program search for the words individually, but there’s more strength to find them together. When tokenizing, being specific is better.


Example: “secure,” “wireless,” “transmission” vs. “secure wireless transmission”.


Stemming

Another important step for preparing the documents for tf-idf analysis is stemming. Most writers are trained to use a broad vocabulary to keep readers interested. This often means using variations of the same word. In fact, the word variation is a great example of stemming. Within one document you might find the word “variation,” “vary,” “varying,” “variety,” “varied,” “varies,” and so on. You could say all of these words stem from “vari.” Stemming teaches the tf-idf to consider all these words as a grouping of the same or similar word.


All together now

By isolating stop words, tokenizing words and key phrases, and stemming, and configuring each to suit the set of documents you wish to analyze, you will essentially increase the usefulness and effectiveness of the tf-idf. See the figure below.





How do we use it?

XORGate Solutions uses tf-idf to mine documents for similar and different terms using specific parameters. This is useful for finding similar documents while also showing you key differences. What makes them the same, but also what sets them apart. This is important if you are trying to investigate something you don’t believe has been done before, or you feel you have made a significant improvement upon something that exists.