Have you ever had a requirement to match similar strings? Or partial strings? Maybe matching strings like Steve, or Steven, or Stephan. The developer automatically starts thinking of the LIKE clause with their favorite database environment. The inner engineer automatically writes a query in their head that would be similar to:

[sql]SELECT * FROM table1 WHERE FirstName LIKE ‘Ste%'[/sql]

You would get the records you desire but you will also get other strings like Stewart, Stephanie, Stella, Sterling, etc. This might be your end goal to find all strings that are comparable.

There is also a technique commonly used within a database called SOUNDEX(). An example would be:

[sql]SELECT SOUNDEX(FirstName) FROM table1[/sql]

You might get results like C640, A365, T460, R543. This could result with beneficial information that an engineer could do something with it, but explaining what A365 means to a non-engineer might become cumbersome in its explanation. For a description of SOUNDEX(), you might look here: http://en.wikipedia.org/wiki/Soundex.

I was recently asked to figure out if there were any ‘like’ records that matched the search criteria (remember the automatic inner query from above). But you would have to shorten the text being searched or provide a more generic query like:

[sql]SELECT * FROM table1 WHERE FirstName LIKE ‘%Ste%'[/sql]

While this query will indeed get you the Steve or Steven you are looking for, it will not get you data that may have been fat-fingered on data entry, and the record actually has a stored value of Setven or Steev.

Within Oracle, there are some existing packages/functions to help with the problems of mistyped data. Oracle provides a package UTL_MATCH with 4 functions:
EDIT_DISTANCE, EDIT_DISTANCE_SIMILARITY, JARO_WINKLER, JARO_WINKLER_SIMILARITY.

EDIT_DISTANCE: Returns the number of changes required to turn the source string into the destination string.
EDIT_DISTANCE_SIMILARITY: Returns an integer between 0 and 100, where 0 indicates no similarity at all and 100 indicates a perfect match.
JARO_WINKLER: Determines how closely the two strings agree with each other and tries to take account the possibility of a data error.
JARO_WINKLER_SIMILARITY: returns an integer between 0 and 100, where 0 indicates no similarity at all and 100 indicates a perfect match but tries to take into account of possible data entry errors.

Using the code below, you can create a simple example on how the above functions work:

[sql]with myTable as
(
select ‘Steve’ FirstName from dual
union all select ‘Steven’ FirstName from dual
union all select ‘Stephan’ FirstName from dual
union all select ‘Stewart’ FirstName from dual
union all select ‘Stephanie’ FirstName from dual
union all select ‘Stella’ FirstName from dual
union all select ‘Sterling’ FirstName from dual
union all select ‘Setven’ FirstName from dual
union all select ‘Steev’ FirstName from dual
)
select
FirstName, ‘Steve’ SearchString,
utl_match.edit_distance(FirstName, ‘Steve’) ED,
utl_match.edit_distance_similarity(FirstName, ‘Steve’) EDS,
utl_match.jaro_winkler(FirstName, ‘Steve’) JW,
utl_match.jaro_winkler_similarity(FirstName, ‘Steve’) JWS
from myTable[/sql]

Results:

FirstName SearchString ED EDS JW JWS
Steve Steve 0 100 1.0 100
Steven Steve 1 84 0.9666666666666667 96
Stephan Steve 4 43 0.7733333333333333 77
Stewart Steve 4 43 0.7733333333333333 77
Stephanie Steve 5 45 0.7511111111111111 75
Sterling Steve 5 38 0.7608333333333334 76
Setven Steve 3 50 0.8900000000000001 89
Steev Steve 2 60 0.9533333333333333 95

You also would be able to make business decisions on how accurate you want the selection of records by adding a where clause:

[sql]select
FirstName, ‘Steve’ SearchString,
utl_match.edit_distance(FirstName, ‘Steve’) EDIT_DISTANCE,
utl_match.edit_distance_similarity(FirstName, ‘Steve’) EDIT_DISTANCE_SIMILARITY,
utl_match.jaro_winkler(FirstName, ‘Steve’) JARO_WINKLER,
utl_match.jaro_winkler_similarity(FirstName, ‘Steve’) JARO_WINKLER_SIMILARITY
from myTable
where utl_match.EDIT_DISTANCE_SIMILARITY(FirstName, ‘Steve’) >= 50[/sql]

Results:

FirstName SearchString ED EDS JW JWS
Steve Steve 0 100 1.0 100
Steven Steve 1 84 0.9666666666666667 96
Setven Steve 3 50 0.8900000000000001 89
Steev Steve 2 60 0.9533333333333333 95

Maybe the like clause is useful when knowing your data is accurate, but the above alternative approach might be able to help find data that has been entered incorrectly.

Like this post? Share it!