Use of Vector Space Model(VSM) to calculate the similarity between 2 text documents

minoli
4 min readJul 16, 2021

The vector space model is an algebraic model for representing text documents as vectors. The most common use of this model is a similarity calculation model. Vector space model has a wide range of applications in NLP(Natural Language Processing).

Vector space model is used in information retrieval, indexing, and information filtering. But the first used was in the smart information retrieval system.

Let’s consider an example to understand the Vector Space Model. Consider a total of 10 unique words(w1, w2, …, w10) in three articles(d1, d2, d3). The statistical word frequency table shows the word frequencies in each article. Using any vector space formula it is possible to calculate the similarity between two text documents.

The statistical word frequency table
Commonly used vector space formula

Calculate the similarity between D1 and D2 articles. Take cosine as an example.

Cosine similarity between D1 and D2

There are two reasons to introduce this as the vector space model.

  1. It is possible to treat each word as a dimension.
  2. The frequency of the word is regarded as its value.

The words and frequencies of each article constitute an i-dimensional space map, two documents. The similarity is the proximity of the two spatial maps(Assuming that the article has only two dimensions, then the space map can be drawn in a plane rectangular coordinate system). When the number of words in the document is huge, the calculation of the above formula is very large. To improve the efficiency of the calculation it is required to use the dimension reduction approach. Need to reduce the number of unique words. It is possible to use the stop word removing approach.

Simple implementation steps of cosine similarity calculation based on VSM(Vector Space Model)

  1. Vector Space Model(VSM)
  2. TF-IDF(Term Frequency Inverse Document Frequency): Assign different weights for each word.

Assign an ‘importance’ weight to each word on the basis of word frequency. This weight is called IDF.

  • The most common words → Least weight
  • The more common words → Less weight
  • Less common words → Greater weight

3. Cosine similarity calculation

After the calculation, it is possible to get the similarity. Manually select 2 documents with similarities, calculate their similarity, and then define their thresholds. With the threshold values, it is easy to define the considered two documents are similar up to which level. Similarly, an article and a class of articles we like can be averaged or searched for the center of a class of article vectors.

Disadvantages of the method,

  • The amount of calculation is too large
  • Adding new text requires re-training the weight of the word
  • The relevance between the words is not considered

The cosine theorem can indicate the reference between the similarities of the article. The closer the cosine value is to 1, the closer the angle is to 0 degrees, that is the more similar the two vectors are.

Finding similar articles algorithm

  1. Using the TF-IDF algorithm to find the keywords of two articles.
  2. Each article takes out several keywords, merges them into a set, and calculates the word frequency of each article for the words in the set.
  3. Generating a word frequency vector for each of the two articles.
  4. Calculate the cosine similarity of the two vectors. The larger the value, the more similar.
Python code snippet for find cosine similarity between 2 sentences using NumPy

--

--