Fuzzy logic needed in name matching

michael · August 19, 2018, 6:46am

This question really has nothing to do with Panorama other than it’s what I’ll use to solve the problem. I have a database containing personal names, some pairs of which are identical matches, some that almost match and some that are nowhere near a match. The pairs are initially identified by the value in another field - each pair has a common value in field X but that commonality does not reliably predict a common identity.

I want to delete all bar one of each set of genuinely matching names. The vast majority are Spanish names which is vaguely relevant because of the way in which Spanish names (especially of married women) are constructed.

In the almost-matching category, I have for instance, these two:

Herrera Sequeira Vega
Sequeira Vega Herrera

or these:

Alicia Del Carmen Hernandez De Barakat
Alicia Del Carmen Hernandez Jimenez Barakat

so I need some fuzzy logic to measure their degree of commonality. My thoughts to date are:

(a) Measure the extent to which the leading characters (including spaces) match and look for an n% match where a suitably high value of n is as yet unknown (as is which of the two letter counts it would be calculated upon). That would score zero on the first example and 67% on the second.

(b) Remove the spaces, sort the characters and look for an n% match across all characters where a suitably high value of n is as yet unknown. That would score 100% on the first example and 82% on the second.

Any other ideas?

admin · August 19, 2018, 8:09am

Perhaps the arrayboth( and arraydifference( functions could be helpful, using spaces as the array separator?

gary · August 19, 2018, 5:38pm

Here is something I threw together to get a rough (fuzzy?) idea of the percent of elements that are common to both names.

local x,y,z,w
x="Alicia Del Carmen Hernandez De Barakat"
y="Alicia Del Carmen Hernandez Jimenez Barakat"
w=arraysize(arraystrip(arraymerge(x,y," "," ")," ")," ")
    // this gets the combined size of both name arrays
z=arraystrip(arraydifference(arraysort(upper(y)," "),arraysort(upper(x)," ")," ")+
    " "+arraydifference(arraysort(upper(x)," "),arraysort(upper(y)," ")," ")," ")
    // this gets an array of the elements that are different between the names
w=100-round((extract(z," ",-1)/w)*100,0.01)
    // this calculates the percentage of matching name elements
message w+"% matching elements."

If used as a formula as with a formulafill it would be written as:

100-round((extract(arraystrip(arraydifference(arraysort(upper(y)," "),
arraysort(upper(x)," ")," ")+" "+arraydifference(arraysort(upper(x)," "),
arraysort(upper(y)," ")," ")," ")," ",-1)/arraysize(arraystrip(arraymerge(
x,y," "," ")," ")," "))*100,0.01)

…where x is the first name and y is the second name to match. Even though the extract( function is now listed as obsolete, It was the simplest solution to determine if the arraydifference( function returned nothing instead of an empty one element array.

gary · August 20, 2018, 12:29am

I wanted to make note that I used two combined arraydifference( functions with the firstarray and secondarray parameters reversed in the second function. That allows me to end up with not only the elements that were in the first array but not in the second array but also those that were in the second array but not in the first array. Possibly a new statement and function called ArrayDifferenceBoth and arraydifferenceboth(.

In the above example:

x="Alicia Del Carmen Hernandez De Barakat"
y="Alicia Del Carmen Hernandez Jimenez Barakat"

The result for arraydifference(arraysort(upper(y)," "),arraysort(upper(x)," ")," ") would result in the element “De” while the result for arraydifference(arraysort(upper(x)," "),arraysort(upper(y)," ") results with the element “Jimenez”. So combining these two gives all the elements that are not common to both arrays.

michael · August 20, 2018, 12:50am

The obvious first step (a variant on Jim’s approach), which came to me at 0200 this morning, is to sort the names as a space-delimited array - quite a few of them will then be perfect matches. Problem with that is that the two (sometimes three) names come from a variety of data entry sources and I will have say, Rebeca in one name and Rebecca in the other or Hernandez and Hernadnez.

The collection of first letters might provide a good first-level test - it would almost always rule out non-matching pairs.

Then, maybe Gary’s idea - I haven’t actually got my head around that one yet but it certainly looks promising. Thank you to both of you.

All of this leads neatly into the next problem which is: What is the most efficient way to compare each little group of two or three records within a 50,000-record database and take whatever action is needed? I’ll put that up as a separate post rather than leave it buried in here because it’s a generic problem whose solution might be of use to other users.

michael · August 20, 2018, 6:30am

I worked out how to do it, adapting code that Dave Thompson wrote for me a little over seven years ago.