Recipe to remove duplicates

tomkerswill · September 19, 2022, 4:46pm

Hi

I’ve used the Select Duplicates menu option to see that I have some duplicate records. I’d like to delete all but one of each. It looks like this isn’t possible by default.

I can see from the forum that it’s been discussed before; and there are hints that an easy way would be to sequence by group number; but I can’t quite figure out the best way to do this:

Would it be possible to have some pointers?

designer · September 19, 2022, 5:29pm

Long, long ago, before there was dirt, there was UnPropagate. In order for Select Duplicates to work, the records had to be sorted on the field you were checking for duplicates.

As I’m remembering it - you didn’t even need SelectDuplicates. You’d sort up on the field to cluster duplicate records together. Then you’d unpropagate which would empty that field for any records that had duplicate content. Then you’d select on that field not equal to empty, and finally RemoveUnselected. NOTE - this ASSUMES legitimate records would always have content in that field.

So six statements:
SelectAll
Field YourField
SortUp
Unpropagate
Select (YourField <> “”)
RemoveUnselected

Select Duplicates, on the other hand, selects those duplicate records so you can manually review them and choose which ones to delete. For example, if you have another field that contains a “date added”, you’d want to keep the most recent and you could see which one it was. Of course, in that case, you could also sortwithin on that entry date field after you’ve sorted on YourField - putting the most recent record on top of its cluster.

tomkerswill · September 19, 2022, 6:07pm

Thanks @designer! That works perfectly… And somehow, I managed to miss that there’s a specific help page about duplicate removal (I think I possibly searched for “duplicates” rather than duplicate):

There’s more info about dealing with existing empty fields; but for me, this wasn’t necessary as I had complete duplictes (even with primary key duplicated), so it was dead easy to do your steps above.

Thanks once again,

Tom

gary · September 19, 2022, 7:48pm

Here is some code I was playing with that will remove duplicate records that have the contents of all the fields matching exactly. I have included an undo capability as well as using nohow to speed things along.

startdatabasechange "ALLRECORDS","Remove Duplicates"
let startCount=info("records")
sortup
firstrecord
noshow
    loop
        let x=exportline()
        downrecord
        loop
            if exportline()=x
                deleterecord
                stoploopif info("eof")
            endif 
            stoploopif exportline()≠x
        endloop
        stoploopif info("eof")
    endloop
    firstrecord
    showpage
endnoshow
nsnotify str(startCount-info("records"))+
    pluralpattern(startCount-info("records")," record~ have been removed!")

By replacing the occurrences of exportline( with a formula you could narrow the search for duplicates to only specific fields (i.e. exportcell(«FirstName»)+tab()+exportcell(«LastName»)). Seems to work in my limited testing but you might want to try it on a copy even though there is the undo feature.

tomkerswill · September 19, 2022, 8:28pm

Ah, that’s excellent - will be really great in my use-case, where I accidentally ran code that re-imported identical records.

I’ve also fixed my original code to use startdatabasechange - as if I’d done that before the error happened, that would have been an easy undo for me!

Tom

admin · September 19, 2022, 9:28pm

This code is not going to reliably remove all duplicate records. It will only remove duplicate records if they are adjacent. The single sortup statement at the top is not sufficient to make sure that all duplicate records are adjacent to each other (and what field is that supposed to sort? it’s not specified).

If you want to remove duplicates based on ALL fields being duplicate, that could be done using the findbelow statement with exportline(). It would be slow-ish, but would work 100%. I’ve got a lot on my plate so I’ll leave this as a hint without actually writing the code. Note - with this technique you wouldn’t need to sort at all.

designer · September 19, 2022, 9:49pm

But one thing you might want to add is a SelectAll at the beginning. I added that to my original post.

Oh, I see, I guess that “ALLRECORDS” handles it.

admin · September 20, 2022, 12:33am

You mean in the startdatabasechange statement? No, that doesn’t cause all records to be selected, and yes, adding selectall is also a good idea. The ALLRECORDS parameter indicates what the Undo will apply to, in this case to changes to all records.