Semantic Similarity Approaches

minoli
3 min readAug 15, 2021

Semantic similarity is to check the similarity between sentences or documents. It will provide an answer for the question “How much does term X have to do with term Y? ”. The answer is a value between -1 and 1 or 0 and 1. An exact similar case answer would be 1.

Semantic similarity is an important concept in Natural Language Processing. Most of the Natural Language Processing applications used semantic similarity to achieve various tasks. Information retrieval, text classification, plagiarism detection, question and answering are some examples of these applications. There are different types of semantic similarity approaches and this will categories them under three topics.

  1. Corpus-based approaches
  2. Knowledge-based approaches
  3. String-based approaches

Corpus-Based Approaches

This approach finds the similarity of the words based on a statistical analysis of the corpus. it is required to have a big corpus in this method. Valuable information is extracted from analyzing the big corpus. A large corpus helps to check the word co-occurrences to estimate the similarity between words accurately. A large number of the proposed approaches in word similarity are corpus-based. Two types of analysis considering under this approach.

  • Normal Statistical Analysis: Latent Semantic Analysis(LSA) is an example of this analysis. In LSA each word is represented by a vector-based on statistical computations. To construct these vectors a big text is analyzed and a word matrix is considered. In the word matrix words are represented as rows and paragraphs are represented as columns. Furthermore, Singular Value Decomposition(SVD) is applied to reduce dimensionality. After reducing the dimensionality based on the constructed word vector, the similarity of the words is calculating using cosine similarity.
  • Deep Learning: Word Embedding is an example of this analysis. In word embedding, a very large corpus is used for training. generate word representation depends on the co-occurrence of words in the corpus. Deep learning is used to train the model to gives a word given the surrounding words. Using this model a vector representation for words can be learned. Finally, cosine similarity between word vectors is used to measure word similarity. There are several word embedding models that are used to find semantic similarity. Word2Vec implemented by Google, Glove implemented by Stanford and FastText implemented by Facebook are popular word embedding models.

Knowledge-Based Approaches

Knowledge-based methods depend on the handcrafted semantic net of words. The meaning of words has been included in this semantic net. The most popular semantic net for a lot of applications is the WordNet developed by Princeton University. These approaches measure the semantic similarity of concepts using knowledge graphs. The knowledge graph is defined as a directed graph.

Directed graph representation

The similarity is measure by considering the distance between concepts in the knowledge graph. Similar concepts have a shorter path from one concept to another.

String-Based Approaches

These approaches measure the likeliness of a sequence of text strings and characters(In here sentence is considered as a sequence of characters). The similarity depends on the comparison between 2 sequences of characters. It can be considered three main methods under this approach.

  • Jaccard Similarity: Considering the diversity between 2 finite sets to find the similarity.
Jaccard similarity equation
  • Levenshtein Distance: Distance between two strings is considered as the minimal number of basic operations(insert, delete or replace) needed to convert one string to another.
  • n-gram: Checking ’n’ continuous words or sounds from a given sequence of text or speech.

--

--