Removing Duplicate Records

gaseous1 · November 15, 2017, 6:27am

I’m having problems removing duplicate records from a database. The DB has about 220,000 records and many records are duplicates. There may be up to 10 or more “identical” records of which I would like to keep 1.

When using selectduplicates in a procedure (after sorting appropriately), I get consistent but incorrect results. I’m expecting about 40,000 - 50,000 unique records, but only end up with several hundred (based on a “select reverse” of the found set from the selectduplicate function).

Any idea what is going on? What, if any, are the limitations of selectduplicates?

admin · November 15, 2017, 7:05am

With the limited information you have provided, I can’t say whether you are using it incorrectly or if there is a bug. When you say “many records are duplicates”, what does that mean? Is there a particular field that contains the duplicate values, or multiple fields together? Please tell us more about your data, and provide the actual code you are using.

gaseous1 · November 15, 2017, 4:23pm

Sorry to have explained this poorly.

I’ve created a file using only the invoice identifiers/numbers in the original data set (this is the information I am using to determine whether a record is duplicate or not) and no other data. I’d be happy to provide that file (about 6 Mb). It exhibits the same behavior.

Thus, we can focus on only the invoice identifiers. This column contains multiple identical records, with a total of nearly 220,000 records. Of that, there are nearly 20,000 unique records (I evaluated the data outside of Panorama). When using selectduplicates in a procedure, Panorama only identifies less than 400 unique records.

The procedure I am using to select the duplicates follows:

field «INVOICE_ID»
sortup

selectduplicates “”

I have a subset of the data in a Panorama file which I can send you. It has the procedure code in it and, on my system demonstrates the behavior I am trying to relate.

dave · November 15, 2017, 4:52pm

If I am reading this correctly, if you have 100 copies of a particular record, you want to count that as 1 unique record. Select duplicates, followed by selectreverse, isn’t going to include any record that has a duplicate, so it would be counting those 100 records as 0 unique records.

To get one copy of each ID, you can do this,

field «INVOICE_ID»
sortup
unpropagate
select «» ≠ ""

gaseous1 · November 15, 2017, 11:03pm

Dave, Thank you for your response. It works well.

One of the advantages of selectduplicates is that it can use data from multiple fields to check for “duplicates.” And yet, it seems to not select the correct data based on the description provided in the help file. There are workarounds, but I’m hoping that this can be fixed, if indeed it is broken.

Eric

dave · November 16, 2017, 2:42am

I don’t think it’s broken.

If this is what you wanted to do, selectduplicates simply wasn’t the command for the job. Let’s say one of your fields has these values, which are already sorted.

A
A
B
C
C
C
D
D
E
F

SelectDuplicates is going to select

A
A
C
C
C
D
D

If you follow that with selectreverse, the selection will be
B
E
F

There were 2 A’s. Each is a duplicate of the other, and both are selected. Likewise, all 3 C’s were selected, and both D’s. When you selectreverse, none of those records are selected anymore, only those without duplicates. There is nothing in the documentation that says it will skip one and select the rest.

gaseous1 · November 16, 2017, 3:15am

Dave,

You clarified this very nicely with your description. Obviously, I was too deep in the forest to see the big picture and confused to boot.

Regards,

Eric

cmcp · November 16, 2017, 7:59pm

Another method is to groupup; then go to outlinelevel 1 and remove unselected, leaving just one instance of each field value.

gaseous1 · November 29, 2017, 6:51pm

Thank you!