Insights

Encora shares perspectives on emerging insights, challenges, and trends across the industries in which we work

Finding Duplicate Addresses Using the Levenshtein Distance Metric in SQL

Carlos Blanco | March 05, 2021

When the LIKE operator in SQL is insufficient or lacks the ability to find strings that are similar but not exactly alike, the Levenshtein distance algorithm is useful.

The Levenshtein distance metric measures the difference between two strings. That is the minimum number of single-character edits that are required to change one string into another other. Single-character edits can be insertions, deletions, and substitutions. For example, the difference between “books” and “back” is three.

  • books -> baoks (substitution “o” for “a”)
  • baoks -> backs (substitution “o” for “c”)
  • backs -> back (deletion of “s”)

Phil Factor writes about the edit distance and gives an implementation in MS SQL Server’s T-SQL for this algorithm in his post String Comparisons in SQL: Edit Distance and the Levenshtein algorithm.

Other DBMS already come with an implementation. For example, the LEVENSHTEIN function in PostgreSQL, the EDIT_DISTANCE_SIMILARITY function in Oracle, and the EDITDIST3 in SQLite.

The Levenshtein distance is useful when trying to identify a string like 931 Main St is the “same” as 931 Main Street. This is a common issue in systems that work with client information such as CRMs. In this scenario, calculating the Levenshtein distance and then transforming it into a ratio based on the length of the largest string can give the percentage of similarity between the two strings. It is up to the individual to decide what percentage is sufficient to determine that two addresses are similar enough to be considered the same.

A ratio of less than 50% difference has yielded good results in the past.

Obviously, this technique is not only for addresses. It can be used to find similar strings as needed. Spell-checkers use the Levenshtein distance to find words to suggest when a person makes a typo. Ispell suggests words with an edit distance of 1, for example.

An Example in PostgreSQL

Here is a table with two text columns. Each column holds an address.

 


CREATE TABLE Addresses(
	LINE1 text,
    LINE2 text
);

 

These are the sample records. Most of the addresses in each row are the same for a human, but they’re not equal to the LIKE operator.

 


INSERT INTO Addresses (LINE1, LINE2)
VALUES ('9128 LEEWARD CIR, INDIANAPOLIS, IN', '9128 Leeward Circle, Indianapolis, IN');

INSERT INTO Addresses (LINE1, LINE2)
VALUES ('101 OCEAN LANE DRIVE, KEY BISCAYNE, FL','101 Ocean Lane Drive Unit 1010, Key Biscayne, FL');

INSERT INTO Addresses (LINE1, LINE2)
VALUES ('9301 EVERGREEN DRIVE, PARMA, OH', '9301 Evergreen Dr, Parma, OH');

INSERT INTO Addresses (LINE1, LINE2)
VALUES ('1817 BERTRAND DR, LAFAYETTE, LA', '2924 Polo Ridge Ct, Charlotte, NC');

INSERT INTO Addresses (LINE1, LINE2)
VALUES ('201 E 87TH ST, NEW YORK, NY', '201 E 87th Street 3E, New York, NY');

INSERT INTO Addresses (LINE1, LINE2)
VALUES ('799 CARRIGAN AVE, OVIEDO, FL', '799 Carrigan Avenue, Oviedo, FL');

INSERT INTO Addresses (LINE1, LINE2
VALUES ('4014 CADDIE DRIVE, ACWORTH, GA', '10617 WEYBRIDGE DR, TAMPA, FL');

 

Here the LEVENSHTEIN function can be used to calculate a ratio of difference and determine which addresses are equal.

 


SELECT
	Line1,
	Line2,
	LEVENSHTEIN(UPPER(line1),UPPER(line2)) as distance,
	LEVENSHTEIN(UPPER(line1),UPPER(line2))::decimal / GREATEST(length(line1), length(line2)) AS ratio
FROM Addresses

 

#

line1

line2

distance

ratio

1

9128 LEEWARD CIR, INDIANAPOLIS, IN

9128 Leeward Circle, Indianapolis, IN

3

0.08…

2

101 OCEAN LANE DRIVE, KEY BISCAYNE, FL

101 Ocean Lane Drive Unit 1010, Key Biscayne, FL

10

0.20…

3

9301 EVERGREEN DRIVE, PARMA, OH

9301 Evergreen Dr, Parma, OH

3

0.09…

4

1817 BERTRAND DR, LAFAYETTE, LA

2924 Polo Ridge Ct, Charlotte, NC

23

0.69…

5

201 E 87TH ST, NEW YORK, NY

201 E 87th Street 3E, New York, NY

7

0.20…

6

799 CARRIGAN AVE, OVIEDO, FL

799 Carrigan Avenue, Oviedo, FL

3

0.09…

7

4014 CADDIE DRIVE, ACWORTH, GA

10617 WEYBRIDGE DR, TAMPA, FL

22

0.73…

 

Finally, a where clause that checks for a ratio of sameness less than 0.5 to identify the addresses that are the same can be used.


SELECT
	Line1,
	Line2,
	LEVENSHTEIN(UPPER(line1),UPPER(line2)) as distance,
	LEVENSHTEIN(UPPER(line1),UPPER(line2))::decimal / GREATEST(length(line1), length(line2)) AS ratio
FROM Addresses
WHERE LEVENSHTEIN(UPPER(line1),UPPER(line2))::decimal / GREATEST(length(line1), length(line2)) < .5

 

This code filters out the two where columns. 1 and 2 are not the same (too far apart, according to the algorithm). Above, are the rows with a ration higher than 0.5 (i.e., rows 4 and 7).

 

#

line1

line2

distance

ratio

1

9128 LEEWARD CIR, INDIANAPOLIS, IN

9128 Leeward Circle, Indianapolis, IN

3

0.08…

2

101 OCEAN LANE DRIVE, KEY BISCAYNE, FL

101 Ocean Lane Drive Unit 1010, Key Biscayne, FL

10

0.20…

3

9301 EVERGREEN DRIVE, PARMA, OH

9301 Evergreen Dr, Parma, OH

3

0.09…

5

201 E 87TH ST, NEW YORK, NY

201 E 87th Street 3E, New York, NY

7

0.20…

6

799 CARRIGAN AVE, OVIEDO, FL

799 Carrigan Avenue, Oviedo, FL

3

0.09…

 

See the example above in this SqlFiddle.

The utility of the Levenshtein distance algorithm is apparent in the above examples as a meaningful tool to discern similarities in data. 

If you’d like to learn more about our services at Encora, 

Contact Us

 

Key Takeaways 

  • When the LIKE operator in SQL is insufficient or lacks the ability to find strings that are similar but not exactly alike, the Levenshtein distance algorithm is useful.
  • The Levenshtein distance metric measures the difference between two strings. That is the minimum number of single-character edits that are required to change one string into another other.
  • The Levenshtein distance is useful when trying to identify a string like 931 Main St is the “same” as 931 Main Street. This is a common issue in systems that work with client information such as CRMs.
  • Calculating the Levenshtein distance and then transforming it into a ratio based on the length of the largest string can give the percentage of similarity between the two strings. This allows for the identification of duplicates.

 

About Encora

At Encora, we create competitive advantages for client partners through accelerated technology innovation. We would be delighted to take you further along your journey.

 

Why Encora 

  • We’re global: With 20+ offices and innovation labs in 12 countries, Encora is globally dispersed. Since we operate in different time zones, there is always someone ready to assist you.
  • We’re full-service: Technology innovation encompasses a huge array of topics, skills, and strategies. We offer a full-service concept, each component custom tailored as per your needs, to present a complete solution. 
  • We’re dedicated: Our ever-growing team of over 4,000 skilled and dedicated programmers, developers, engineers and strategic thinkers is the reason Encora has become an award-winning tech company with an enviable reputation.
  • We’re experienced: We partner primarily with fast-growing tech companies who are driving innovation and growth within their industries. Our unique delivery model, agile methodology, and consistent unmatched quality have contributed to our steady growth.

Insight Content

Share this Post

Featured Insights