Simple Record Merge

Squirrel · February 3, 2024, 12:11pm

I imported various versions of the same database across time and need to merge (not delete) duplicate records. In some cases there are more fields that exist in a record than another.

Unpropagate does not seem to do this as it deletes duplicates. The merge in the menu only wishes to merge two columns and not two records. I spent quite a bit of time reading other people’s merge topics and none of the solutions seem to do a simple merge of two or more similar but not completely duplicate records.

I have used sort and select duplicates but other than cutting and pasting thousands of records by hand, I can’t seem to find an appropriate solution.

JamesCook · February 3, 2024, 1:54pm

That makes it appear that they’re not duplicate records but apparently have some identifier that may occur in multiple records.

If that’s the case, I’m guessing too that your need to merge them is because the “same” record from one past database to another has different information in it. That’s a tough one to handle especially if a lot of records are involved. There is no automated way for Panorama to figure out how to combine them for you.

You might start by grouping on whatever field has the identifier. Then in fields that appear to have no differences, run a Maximum. Even on text fields that will give each summary record a value from within the group. Fields that have obvious differences within any one group will require manual intervention to place what you decide is the desired result into the summary record.

Once you’ve got all fields handled, use Analyze > Remove Detail to delete the original records and retain just the summaries. The last record will need to be deleted too since it was a grand summary of all the rest. That leaves you with a new set of data records bearing the result of your effort. It’s the closest thing I can think of to merge your records.

designer · February 3, 2024, 6:25pm

With two or more records having somewhat the same data and/or fields, the challenge is the specify an “algorithm” defining the criteria for change - which data gets kept and which gets overwritten. I suppose it starts with defining how you determine the records are “the same”.

If you are trying to do this “in place” it can be more complicated than employing a third database. Here’s a skeleton …
Open Database A (DBA) and Database B (DBB),
Select the records in DBA and DBB you want to combine.
Loop down DBA, finding the appropriate match in DBB
Create a third record (in a variable or clipboard) of the fields and data from the DBA and DBB records. I’ll assume there is only a 1 to 1 match but its not much more difficult to add more “matching” records in the build
Add that built record to a third (new) database
Mark the records from DBA and DBB (put a character you can select on in a spare field) so you know they’ve been processed.
When done matching records in DBA with DBB, Select the processed records and delete them

At this point, there are no other records in DBA with matches in DBB. You have updated records in the third database and you have the records that were updated removed from DBA and DBB. Combine them as you will.

Doing this in a loop can be way slow. combining records with a relational join turns hours of processing time into seconds. Yes, seconds! But - that assumes valid data in all fields. You have two records with the same field and one of them has data you want to keep rather than the other - that makes it difficult. Because Panorama (or any other software merge app), needs to know your criteria for making the swap.

Without knowing your criteria I can’t be more explicit.

Squirrel · February 3, 2024, 9:05pm

All columns are the same across all datasets. Some have data missing in some columns but have the same data in others. There are 2 or more duplicates based on name and proper name (this is a locations database). I sorted to identify duplicates using the fields name, proper_name, latitude, longitude. I was able to deduplicate many records in this manner using another tool but I have other data that exists in one field but not in the other of these records. Duplication now is being established by the same name and proper_name

Squirrel · February 3, 2024, 9:15pm

Thanks for explaining this to me. This seems like a good solution but perhaps too complicated for my task?

Eventually this data needs to be exported as a csv or Excel file. I don’t need this to be relational. The end result is to run it through outside visualizations and mapping.

I have deduplicated entries based on name, proper_name, latitude & longitude (it’s a locations db). However, some of the other 20-40 fields (depending on which dataset I use as a base) are empty in some records and filled in others.

The merge criteria I am looking for says “choose the entry with the most data and fill in the blanks with other designated duplicates based on name (or more than one criteria like name and proper_name)”.

I hope this explains my situation a bit more clearly.

Squirrel · February 3, 2024, 10:00pm

I may be using the wrong vocabulary - instead of merge, I think join is the right way to go here.

designer · February 4, 2024, 12:29am

Yes - join the party. I do understand your criteria - hold the record that has the most fields filled in - then see if any remaining empty fields in that record have content in their matching records. If so, fill the empty field with the content from the matching record.

When I used a join, it was simply to fill a empty field in DBA with content from the field in matching records in DBB. Looping through 20,000 records in DBA looking for the match in 20,000 records in DBB took HOURS. When I used the Join instead, it took the time of a slow eye blink. It was so much fun, I did it over and over a few times.

It was like when I finally learned to roll my kayak in the pool. Once my body (muscle memory) realized how easy it was, I would just roll and roll. The instructor said the class was over and as I paddled to the edge of the pool - Oops over I’d go again - just for the fun of it.

Panorama is like that.

admin · February 5, 2024, 2:11am

There’s not any tool in Panorama that will do this out of the box. I doubt if any other database software exists that would do this either.

You could develop your own custom code using Panorama’s programming language to do this. It won’t be trivial to do though. You’ll need to come up with an algorithm for determining which entry has the most data for each record. Depending on the source of the original data there may be many special cases that have to be handled. Whether this is worth the effort depends on how many records there are, how many different special cases there are, and how comfortable you feel with programming.

Squirrel · February 10, 2024, 6:12am

I love this exuberant description!

Squirrel · February 10, 2024, 6:14am

I was able to solve this by using another data program, exporting and re-importing back to PanoramaX. If I had the time to learn a programming language (I am not a coder) I would gladly spend time on this as I run into this issue much more frequently now.