Cosine similarity between two sentences
Cosine similarity between two sentences can be found as a dot product of their vector representation. Their are various ways to represent sentences/paragraphs as vectors.
cosine_similarity(A, B) = (A • B) / (|A| * |B|)where:
- A • B is the dot product of vectors A and B,
- |A| is the magnitude (or norm) of vector A,
- |B| is the magnitude (or norm) of vector B.
Here are two very short texts to compare:
Julie loves me more than Linda loves me
Jane likes me more than Julie loves me
We want to know how similar these texts are, purely in terms of word counts (and ignoring word order). We begin by making a list of the words from both texts:
me Julie loves Linda than more likes Jane
Now we count the number of times each of these words appears in each text:
me 2 2
Jane 0 1
Julie 1 1
Linda 1 0
likes 0 1
loves 2 1
more 1 1
than 1 1
We are not interested in the words themselves though. We are interested only in those two vertical vectors of counts. For instance, there are two instances of ‘me’ in each text. We are going to decide how close these two texts are to each other by calculating one function of those two vectors, namely the cosine of the angle between them.
The two vectors are, again:
a: [2, 1, 0, 2, 0, 1, 1, 1]b: [2, 1, 1, 1, 1, 0, 1, 1]
The cosine of the angle between them is about 0.822.
These vectors are 8-dimensional. A virtue of using cosine similarity is clearly that it converts a question that is beyond human ability to visualize to one that can be. In this case you can think of this as the angle of about 35 degrees which is some ‘distance’ from zero or perfect agreement.
An example of cosine similarity using JavaScript is described in String Similarity Comparison in JS with Examples.
Source Code: