This question really has nothing to do with Panorama other than it’s what I’ll use to solve the problem. I have a database containing personal names, some pairs of which are identical matches, some that almost match and some that are nowhere near a match. The pairs are initially identified by the value in another field - each pair has a common value in field X but that commonality does not reliably predict a common identity.
I want to delete all bar one of each set of genuinely matching names. The vast majority are Spanish names which is vaguely relevant because of the way in which Spanish names (especially of married women) are constructed.
In the almost-matching category, I have for instance, these two:
Herrera Sequeira Vega
Sequeira Vega Herrera
Alicia Del Carmen Hernandez De Barakat
Alicia Del Carmen Hernandez Jimenez Barakat
so I need some fuzzy logic to measure their degree of commonality. My thoughts to date are:
(a) Measure the extent to which the leading characters (including spaces) match and look for an n% match where a suitably high value of n is as yet unknown (as is which of the two letter counts it would be calculated upon). That would score zero on the first example and 67% on the second.
(b) Remove the spaces, sort the characters and look for an n% match across all characters where a suitably high value of n is as yet unknown. That would score 100% on the first example and 82% on the second.
Any other ideas?