Have you wondered how Google approximates your search queries? How can their algorithm find relatively similar search queries without 100% matching the exact string you inputted? Although Google’s search algorithm involves more than a SQL Fuzzy Match, this matching technique is an integral part of making your search experience seamless and relevant.
Besides relevancy, other problems include scaling the fuzzy matching algorithms within large databases in SQL. A simple, less accurate fuzzy matching algorithm can be easier to develop in the short run but might be inefficient or hard to maintain in the long run. Below, we look at fuzzy matching in SQL and the basic concepts we can learn in order to generate a better approach to meet our needs.
Fuzzy matching, which can also more aptly be called Approximate String Matching, is a method that enhances the ability to provide predictive search functions, detect misspellings, and return approximate values.
Fuzzy matching is a practical application of “fuzzy logic.” Essentially, while most algorithms stem from a binary perspective (i.e., having 1 or 0 as return values), fuzzy logic returns numerical values that can determine “truthiness” or “falseness” (i.e., “degrees of truth”).
What this logic implies is that the in-betweens are taken into consideration. Applying the logic to fuzzy matching, the results are not only “match” and “non-match” (binary outputs) but can also determine the similarity score of two strings (relevance).
The approximate, “in-between” values do not have to 100% match the string input but must meet a certain threshold to be considered “similar enough.”
Fuzzy matching in SQL has many benefits outside of matching and search prediction. Just a few of those applications can be found in the following areas:
Finding exact matches is an easy task in SQL, but the same thing cannot be said for generating approximate matches.
While fuzzy matching may seem simple, there are many methods you can employ in SQL to do a fuzzy match, making it daunting to design the best course of action. While one approach may be objectively better than others for a particular situation, there is no one correct way to fuzzy match in SQL.
Here are the fuzzy matching methods we will take a look at:
While more complex fuzzy matching methods exist, sometimes simpler ones work wonders. This fuzzy matching technique is relatively quick, easy to implement, and easy to understand. Moreover, this fuzzy matching method is easy to use for beginners!
To start, we will use the LIKE
operator in SQL for this method.
Problem Set: Given a table of pupils, determine which pupils have a name similar to Smith
Select StudentName, LEN(StudentName) - LEN('Smith') from Students
WHERE StudentName like '%Smith%'
Order By LEN(StudentName) - LEN('Smith')
Example output should look like the following:
Peterson Smith | 9 |
Smithson Ford | 8 |
Smith Jay | 4 |
Smith | 0 |
Let’s break the code down. The LEN function returns the length of a string; through this, we can generate a pseudo-similarity algorithm that bases its output score on the differences in string length.
Using the WHERE
clause, we can select only those strings in which Smith
is placed on the original string’s left or right-hand side. This restriction is determined by the (%) set at the start and end of Smith
.
The Levenshtein distance, also known as the edit distance, is an algorithm that counts the differences between two strings based on their mismatches. For example, given the word ‘Hello’ and ‘Mellow’, you will need to change ’M’ to ‘H’ and remove ‘W’, which means the two words have an edit distance of two.
This algorithm is not necessarily easy to implement but is still commonly deployed in SQL and other coding languages. Learning this approach will give you flexibility across a range of situations, and is one of the best methods to learn in constructing a fuzzy match.
Let’s look at the code:
DELIMITER $$
CREATE FUNCTION levenshtein( str1 VARCHAR(255), str2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE str1_len, str2_len, i, j, c, c_temp, cost INT;
DECLARE str1_char CHAR;
DECLARE cv0, cv1 VARBINARY(256);
SET str1_len = CHAR_LENGTH(str1), str2_len = CHAR_LENGTH(str2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF str1 = str2 THEN
RETURN 0;
ELSEIF str1_len = 0 THEN
RETURN str2_len;
ELSEIF str2_len = 0 THEN
RETURN str1_len;
ELSE
WHILE j <= str2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= str1_len DO
SET str1_char = SUBSTRING(str1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= str2_len DO
SET c = c + 1;
IF str1_char = SUBSTRING(str2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END$$
DELIMITER ;
Implementation from Artful Software
This implementation of the Levenshtein algorithm may be pretty intimidating to look at due to the nested ‘WHILE’ loops, but let’s take a quick tour of the code:
In this algorithm, we utilize hex conversions to generate our distance return value, denoted by the variable c. Let’s shift our focus from the rest of the code and focus on the nested ‘WHILE’ loops, as this is where the magic happens.
Going off with this block:
WHILE i <= str1_len DO
SET str1_char = SUBSTRING(str1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= str2_len DO
SET c = c + 1;
IF str1_char = SUBSTRING(str2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
We can see that a temporary variable, str1_char, and str2_char, has been made to store the current characters to be compared. We set str1_char to the current character we are pointing at in the string. For example, where i=2, in the word HELLO,
we are pointing to the character E.
We set a similar variable to our second character, denoted by str2_char. We then compare if these two are the same. If they are the same, the cost variable will be set to zero; else, we will set it to one. The cost variable will be tallied to generate the distance between the strings.
The process will then be done iteratively until we are done comparing the two strings.
Soundex is a phonetic algorithm that both generates and indexes strings based on their sound. Soundex can help match strings despite the differences in spelling, and it mainly concerns itself with consonants.
What makes Soundex the most popular algorithm of its kind is its general standardization and, nowadays, wide-reaching implementation in SQL database systems.
To implement Soundex, we use the following code:
Select StudentName,
DIFFERENCE (Soundex(StudentName), Soundex('Smith'))
from Students
WHERE StudentName like %Smith%
Order By DIFFERENCE (Soundex(StudentName), Soundex('Smith')) DESC
Let’s break down the code. We generate the differences in Soundex values between two strings using the DIFFERENCE
(Soundex(str_x, str_y)) function to create a numerical value that can help differentiate or contrast these two strings phonetically.
Of course, we will only include those records with %Smith% in the actual string. However, one can change this method dramatically to become more lenient.
For example, instead of requiring the presence of the string Smith
inside the word, it can also be set to return a match where the Soundex values between two strings must not be of this degree.
Nevertheless, the approach above also works and is an easy and straightforward way to make a Fuzzy Matching algorithm using SOUNDEX.
Fuzzy matching is essential for most workloads, but as your databases grow, simple fuzzy matching algorithms may take minutes to hours before they can process your query. In this section, we will discuss the details of an advanced fuzzy matching algorithm: Term Frequency, Inverse Document Frequency (or tf-idf for short).
Before we proceed with the main algorithm, let us understand the core concepts of fuzzy matching large datasets:
The first concept we need to learn is simplification or deduplication, which means we will need to append, simplify, extend, or compress words. For example, First St. will turn into First Street, while Los-Reyes will turn into Los Reyes. We will need to do this to simplify our algorithm’s workload and still deliver accurate and helpful results.
We also need to conduct record linkage, wherein we will link two or more elements with no unique identities but which are similar nonetheless. For example, we might want to link “Burma” and “Myanmar” despite the two words having a poor Levenshtein distance score.
Basic Term Frequency, Inverse Document Frequency Introduction
The tf-idf algorithm is used in Natural Language Processing fields (NLP) and involves dividing the text into n-grams (or chunks) and creating numerical values based on their rarity. To keep things clean, common words such as “are,” “is,” “and,” “I,” and “you” are filtered out.
By applying the algorithm, we can compare close matches throughout the text that appear less common. For example, uncommon keywords are valued more due to their rarity of occurrence, thus creating a better, more accurate algorithm.
The question below is a possible SQL interview question that you may encounter about Fuzzy Matching:
Let’s say there are two tables, students and tests.
The test table does not have a student ID. However, it has both the first and last name, as well as the date of birth, which we should consider as possibly inaccurate since these details are entered manually.
What process would you use to determine which student in the student rosters took which exam?
Notes: You can assume that you can utilize a human to support the review of your matches and that you need to evaluate thousands of rows of data for this project.
Input:
students
table
Columns | Type |
---|---|
id |
INTEGER |
firstname |
VARCHAR |
lastname |
VARCHAR |
date_of_birth |
DATETIME |
Output:
tests
table
Columns | Type |
---|---|
firstname |
VARCHAR |
lastname |
VARCHAR |
date_of_birth |
DATETIME |
test_score |
INTEGER |
test_date |
DATETIME |
Most popular answer from Interview Query:
I would start with the assumption that making a mistake on more than one column is unlikely. I would select a candidates table where each test has several rows, each consisting of students with at least two matching columns. To select from the candidates, I would use a dictionary of common typographical mistakes for the names and dates. If all three match, I would select that option. Otherwise, go with a measure of the difference between the values.
If two identically named students are transposed, then the only recourse would be manual selection.
Another user-submitted answer showcased the following:
Other than exact matches for firstname
+ lastname
+ dateofbirth
, I think you’d want to look for matches to 2⁄3 and have a human review the third field to see if it approximately matches.
For example, for Jon Doe born 5/7/1990, you might match Jonathan Doe 5/7/1990 or Jon Doe 6/7/1990.
I’d expect an exact match on 3⁄3 fields to net ~80% of matches, then need 2⁄3 matching to get another ~15%.
Finally, if it’s only thousands of rows, you should be able to hand-match the remaining ~5%. I’d sort by last name since it’s more likely to be unique than first name and is less susceptible to a fat-finger mistake than DOB. (Even if you slightly misspell a last name, it’s going to look similar enough that a human can spot a match.)
Using the second solution, let us build a pseudo code SQL query to accomplish precisely that.
firstname
, secondname
, and date_of_birth
columns from the Students_table
, and store this concatenation into a new column. For ease of reference, we can call this column compare_A
. By aggregating our relevant data, we can easily compare each student’s respective information.Tests_table
: concatenate the firstname
, secondname
, and date_of_birth
columns. For ease of reference, we can call this concatenated column ‘compare_B’compare_A
to all the elements of compare_B
using the Levenshtein Distance algorithm.compare_B
is longer, use that element’s length).