String Similarity Comparison in JS with Examples

Suman Kunwar
6 min readMar 2, 2019

--

In this article, we are going to discuss different types of algorithms used in string matching and determine how similar they are with various examples.

Fuzzy string matching is a type of search that matches the result even when the user enters the misspell words or enter any partial word of the search. It is also known as approximate string matching

“In computer science, fuzzy string matching is the technique of finding strings that match a pattern approximately (rather than exactly)”

Levenshtein Algorithm

The Levenshtein distance is the minimum number of single-character edits required to change one word into the other, so the result is a positive integer, sensitive to string length. Which make it more difficult to draw a pattern.

For example,

  • The Levenshtein distance between “foo” and “bar” is 3
  • The Levenshtein distance between “beauties” and “beautiful” is also 3
  • For us, humans, the “beauties”/”beautiful” pair is much more similar than the “foo”/”bar” pair. But the Levenshtein distance is the same.

Example of levenshtein.js:

JS Performance: https://jsperf.com/levenshtein-distances/1

Trigram Comparison

A trigram algorithm is a case of n-gram, a contiguous sequence of n (three, in this case) items from a given sample. In our case, an application name is a sample and a character is an item.
So The sequence “martha” has 4 trigrams { mar art rth tha }.

We can use the Trigram method to compare two strings.

Taking for example “martha” and the same word with a typo, “marhta”, and we can compute their trigrams:

Trigrams “martha”: { mar art rth tha }

Trigrams “marhta”: { mar arh rht hta }

To measure similarity we divide the number of matching trigrams in both strings: 1 { mar } by the number of unique trigrams: 7 { mar art rth tha arh rht hta }

The result is 1/7 = 14%

To balance the disadvantage of the outer characters (somewhat to reinforce the similarity of strings starting and ending with the same trigrams), we pad the string with blanks on either side resulting in these cases into three more trigrams “_ma”, “ha_“ and “ta_”.

Trigrams “ martha ”: { _ma mar art rth tha ha_ }

Trigrams “ marhta ”: { _ma mar arh rht hta ta_ }

Having done that, the number of matching trigrams is up to: 2 { _ma mar }
The number of all unique trigrams: 9 { _ma mar art rth tha arh rht hta ha_ }

The result is now 2/9 = 22%

Example of Trigram.js:

JS Performance: https://jsperf.com/trigram

Cosine Similarity

Cosine similarity between two sentences can be found as a dot product of their vector representation. There are various ways to represent sentences/paragraphs as vectors.

similarity= cos(a,b)= dotproduct(a,b) / ( norm(a) * norm(b) )= a.b / ||a|| * ||b||

Here are two very short texts to compare:

  1. Julie loves me more than Linda loves me
  2. Jane likes me more than Julie loves me

We want to know how similar these texts are, purely in terms of word counts (and ignoring word order). We begin by making a list of the words from both texts:

me Julie loves Linda than more likes Jane

Now we count the number of times each of these words appears in each text:

 me     2   2
Jane 0 1
Julie 1 1
Linda 1 0
likes 0 1
loves 2 1
more 1 1
than 1 1

We are not interested in the words themselves though. We are interested only in those two vertical vectors of counts. For instance, there are two instances of ‘me’ in each text. We are going to decide how close these two texts are to each other by calculating one function of those two vectors, namely the cosine of the angle between them.

The two vectors are, again:

a: [2, 1, 0, 2, 0, 1, 1, 1]b: [2, 1, 1, 1, 1, 0, 1, 1]

The cosine of the angle between them is about 0.822.

These vectors are 8-dimensional. A virtue of using cosine similarity is clearly that it converts a question that is beyond human ability to visualize to one that can be. In this case, you can think of this as an angle of about 35 degrees which is some ‘distance’ from zero or perfect agreement.

Example of Cosine-Similarity.js:

JS Performance: https://jsperf.com/consine-similiarity

Jaro-Winkler Algorithm

“In computer science and statistics, the Jaro-Winkler distance is a string metric for measuring the edit distance between two sequences.

Informally, the Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other.

The Jaro-Winkler distance uses a prefix scale which gives more favourable ratings to strings that match from the beginning for a set prefix length”

Source: Wikipedia.

Giving “more important” to words with identical prefixes made the Jaro-Winkler distance seem very interesting for our use case.

Starting from the beginning with the Jaro distance formula, here how it works. Don’t panic, we go step by step:

The Jaro Distance between two sequences s1 and s2 is defined by:

Jaro distance formula

dj is the Jaro distance
m is the number of matching characters (characters that appear in s1 and in s2)
t is half the number of transpositions (compare the i-th character of s1 and the i-th character of s2 divided by 2)
|s1| is the length of the first string
|s2| is the length of the second string

With an example. Let’s take “martha” and “marhta”.

m = 6
t = 2/2 =1 (2 couples of non matching characters, the 4-th and 5-th) { t/h ; h/t }
|s1| = 6
|s2| = 6

Just by replacing numbers is the formula, we get:

dj = (⅓) ( 6/6 + 6/6 + (6–1)/6) = ⅓ 17/6 = 0,944Jaro distance = 94,4%

Now we know what is the Jaro distance, let’s jump to the Jaro-Winkler distance.

The Jaro-Winkler similarity uses a prefix scale p which gives more favorable ratings to strings that match from the beginning for a set prefix length l.

p is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. The standard value for this constant in Winkler’s work is p=0.1.

l is the length of the common prefix at the start of the string (up to a maximum of 4 characters).

Jaro-Winkler distance formula

So back to the “martha”/ “marhta” example, let’s take a prefix length of l = 3(which refers to “mar”). We get to:

dw = 0,944 + ( (0,1*3)(1–0,944)) = 0,944 + 0,3*0,056 = 0,961Jaro-Winkler distance = 96,1%

Using the JaroWinkler formula we go from the Jaro distance at 94% similarity to 96%.

Example of Jaro-wrinker.js

JS Performance: https://jsperf.com/jaro-winker

Sources:

Source Code:

--

--

Suman Kunwar
Suman Kunwar

Written by Suman Kunwar

Innovating Sustainability | Researcher | Author of Learn JavaScript : Beginners Edition

Responses (2)