Integrity Checks: Server corruption

CooperT · September 2, 2021, 4:50pm

Yesterday, for the second time, a user has reported repeated crashing with shared database BP. When I opened BP, I immediately received the alert reporting corrupt data. (This same thing happened six weeks ago.) The crashing began when running a procedure that transfers data from a standalone DB to BP. (There was some uncertainty on the user’s part of what she did immediately before the first in the series of crashes.). Admittedly, I need more detailed information about the steps preceding the first crash and subsequent crashes. (According to the Help Page on Integrity Checks, the DB might have some corrupt data from some unknown amount of time before crashing starts, so that makes it difficult to consider when the corruption occurred.)

Based on what I saw, it appears that the data became corrupted on her computer and was saved to the server and then transferred to me. If I could use the Integrity check feature to avoid that, I would certainly do it. Any ideas on how to avoid spreading the corruption via the server? (I see that at least one other person (panaca on 4/8) has reported what was believe to be corrupted server data; but that was before integrity checks.)

Just to complicate matters, I obtained a copy of her DB and dbinfo(“datacorruption”) is empty, meaning it found no data structure corruption.

The user has Time Machine running, and we restored a copy of the DB from that, but since the server was corrupted, we had to detach and discard the server copy (is that true?) Then I was unsure of how to identify the latest uncorrupt copy from Time Machine. We used the one the the end of the previous day and that worked to get a stable uncorrupted copy, but a significant amount of work was lost as a result. Any ideas about how to identify the latest uncorrupt copy of the DB?

The corruption message gives the record and cell number of the corrupt data. Is there anything to be done with that information? Is the record number the same as the serverrecordid? Could it just be deleted?

Finally, if corruption is a rare event caused by factors unrelated to PanX procedures, why would two cases of corruption occur within six weeks on the same computer performing similar operations? The coincidence would be far less likely: if the chance of this external corruption is x percent, then the probability of two them is x squared if they are independent events. The alternative hypothesis is that something has run amok during a PanX procedure leading to the corruption. This hypothesis does not conclude that the problem lies in Panorama; it could just as well lie in hardware or Apple code.

admin · September 2, 2021, 7:30pm

If integrity checks are enabled, the structure of the data in memory is checked each time the data is saved to disk, before the data is saved to disk. If corruption in memory is found, Panorama will not save the data to memory. The intent is that it’s better to lose the most recent changes rather than permanently corrupting the data on disk. So for example if your RAM loses a bit, that would never get permanently saved.

On the flip side, Panorama also checks the structure of the data each time a database is loaded from disk. If it finds corruption, it refuses to open the database. Here the intent is that if for example your disk is starting to fail, Panorama will immediately detect that so that you don’t start working with already corrupted data.

The corruption check only happens when the data is saved to disk. So if you went five minutes between saves, for example, the problem could have occurred anywhere in that 5 minute period.

before crashing starts

If the corruption causes a crash, by definition that means the data wasn’t saved. If the corruption was in memory only, then you would think that when Panorama was relaunched, all would be well again since the copy on disk wouldn’t be touched. You wouldn’t expect the problem to repeat.

I assume this user has the initials LC, they had a dozen crashes yesterday in a 75 minute period. Of those, they had 6 crashes when sorting over a 40 minute period. Certainly corrupted data could cause the sort operation to crash. In fact sort is a bit of a stress test since it will access every record in the database. But again, if data became corrupted in memory which caused sort to crash, I would expect that to repeat since the corruption would all be flushed away with the crash.

I’m wondering about your procedure that “transfers data from a standalone DB”. How is that transfer being accomplished? Perhaps there is corruption in the standalone database that is getting transferred, or perhaps somehow the transfer process is causing corruption? That shouldn’t be the case, but obviously something unexpected is happening here.

That is actually exactly how it is supposed to work. The system is designed to prevent corrupted data in memory from ever being saved to disk.’

The corruption message gives the record and cell number of the corrupt data.

Ok, so when and where did this corruption message appear (since we know it did not appear on your computer when you tried to open a copy of this database)?

Is the record number the same as the serverrecordid?

No, it isn’t. It’s simply the count of records from the top of the database, the first record is 1, the second record is 2, etc. I think if it said there was corruption in record 384, you could find that record with this code:

selectall
find seq()=384

At that point, you might be able to delete the record, depending on the nature of the corruption. It’s also possible that the find statement might cause Panorama to crash. Also, note that the corruption scan stops when it first finds corruption. So there’s no assurance that other records past that point are all valid, in other words, it’s possible that there could be additional corrupted records.

Serious question – how did you determine the server copy was corrupted? If you open a database on your local computer you’ll get an alert, but obviously there is no such alert on the server. If I did it correctly then the server should report the error back to the client which should tell you about it, but I am not confident that this actually happens. I’ve made a note to investigate this further.

But also, normally even if the server did somehow ingest some corrupted data into memory, it should never save that corrupted data to disk if the integrity check feature hasn’t been disabled for that database, which I highly doubt you would have done (and actually, if you had, you wouldn’t have seen the corrupted data error message).

No, since according to my understanding there should not be any corrupted copies of the database saved to disk. Preventing saved copies of corrupted data is the major goal of the database integrity feature. “What gets corrupted in memory stays in memory.”

I’m afraid I don’t have a ready answer to wrap this up for you, but I’ve made some notes of areas for me to investigate.

CooperT · September 2, 2021, 8:12pm

I see that I did not explain this clearly in the first message. When I learned that the user was experiencing crashes, I opened the DB on my computer, and immediately received the corrupt data message. I assumed that corrupt data had just been transferred to me from the server.

CooperT · September 2, 2021, 8:45pm

The standalone database is used for data entry. It has an identical field structure. The code, called from the standalone DB (BPBillEntry), which transfers the data from the standalone DB to the main shared DB, is

opensecret “BP”
setactivedatabase “BP”
importdatabase “BPBillEntry.pandb”,“MatchingFields”,“Yes” //This statement transfers the data //[–Import Data–]
console “file imported into BP, then continued.”
setactivedatabase “BPBillEntry”

The data in BPBillEntry is then deleted, and there are some additional changes in BP.

admin · September 3, 2021, 5:20am

I think that may not be correct. The sequence of operation is:

Panorama loads the local copy of the database from your local drive into RAM.
Panorama checks the integrity of your data.
Panorama synchronizes with the server.

Note that the integrity check happens before the synchronization. There is no integrity check after the synchronization (perhaps there should be, but that is a separate issue, and there is a kind of integrity check on all data transferred to or from the server). If the error occurs when you open the file, that means that the disk file got corrupted. That would seem to mean that some other program modified the file on the disk, because Panorama does an integrity check immediately before saving. If the check fails, Panorama refuses to write the data.

When I learned that the user was experiencing crashes

A crash could possibly be caused by corrupt data in memory, but since the program has crashed that corrupt data would not ever be saved to the disk. The next time the database is opened, a fresh uncorrupted version of the data is brought into memory.

I can understand why you would be suspicious, but with all the facts in hand I think it is a coincidence that LC was experiencing crashes while you had corrupted data. Especially since her database did not contain any corruption, I don’t think any corruption was transferred from her machine to the server and on to other computers. Since you didn’t mention it, I assume that no other users reported corrupted data?

That said, if I ever see a corrupted data message when opening a file, I will freak out, because I would assume that meant that there was something wrong with the filesystem on my disk. That used to be not such an uncommon thing, but I don’t think I’ve seen that happen in 10 or 15 years, maybe 20. For example it could mean that the filesystem thinks a single block on disk belongs to two or more files, which should never happen. Basically that error means that between the time when Panorama saved the file and when it opened it, something changed the contents of the file. By definition Panorama didn’t have the file open, so it didn’t make the change, so what did? Either the filesystem is damaged or the drive is not working. Either way, it’s a major problem. Most programs don’t check, so you might never know. So now that Panorama does check it could be a “canary in the coal mine” for this type of problem.

which transfers the data from the standalone DB to the main shared DB

The importdatabase statement does not check the integrity before transferring data. Since you are then deleting the data, possibly before saving it, it might never be reported in the standalone database. It doesn’t sound like this actually happened in this case, but I have already made a note to add integrity checking to the importdatabase statement so that this scenario becomes impossible.

CooperT · September 4, 2021, 2:21pm

I am trying to formulate a plan for what to do if this issue arises again. In studying the Database Integrity Checks help page, Jim wrote “It’s probably a good idea to carefully check the exported data for damage – if the database structure is damaged, there is a high probability that some of the data itself is also damaged.”

What can someone do to check the exported data for damage? In a very small database, one could scan the data, but the database that was a problem for me has over 70,000 rows, so scanning would be difficult. Perhaps I can use the information in the Corrupted Data message, giving a row and cell with a problem.

admin · September 4, 2021, 5:20pm

I think visually scanning is the only technique available.

Each database record starts and ends with the length of the record. When scanning for corruption, Panorama checks that these lengths match. It also checks that it can scan from top to bottom and that the sum of the lengths for each record adds up to the total size of each record.

Within a record each cell also starts with the length of the cell. So Panorama also checks to make sure that the sum of the length of the cells in a record matches up to the total length of the record. If a record is 125 bytes long but a cell in the record is 5000 bytes long, that’s a sure sign of corruption.

Note that all of this checking only involves the lengths. Panorama can check the lengths because there is redundant information, the lengths have to add up. There is no such check possible for the data itself. If a letter “a” changes to the letter “b”, Panorama can’t check that – it would have to be done by a human that knows what the data should look like. For example if somehow the name “Dan” got changed to “Dbn”, Panorama can’t detect that.

All is not lost though. In addition to checking for corruption, Panorama also saves a cryptographic hash each time the database is saved. When the file is re-opened, it generates the hash again from the newly loaded data, and checks it against the hash that was saved. If the hashes don’t match, then something changed. In that case you won’t get an error message about a particular record or cell, because all Panorama knows is that something is different, not the location. But this check will detect changes in the data itself, not just in the lengths, so it will know that “Dbn” is wrong (assuming the data wasn’t actually typed that way).

In general, the best plan if one of these errors occurs would be to go back to a backup. As you say, with 70,000 records there is no way to reliably scan the file manually. The point of the damage checks is to detect damage as soon as possible, so that you can STOP right there and hopefully have a good backup that is only minutes old at that point. Before, Panorama would silently continue and you might not realize that there was damage for hours, days, weeks or months, at which point it might be far too late to find a good backup.

In this case I think you could have just downloaded a fresh copy from the server. I’ve reviewed LCs crash logs, and I don’t think corruption came from her computer.

RAmeeti · September 4, 2021, 6:34pm

The nuance of what does ‘corruption’ mean varies the answer to this important question. Over the years I’ve determined the location of ‘corruption’ in many ways but what seemed to be the quickest way when I received errors what to do a sort of every field. Inevitably, that would most often fail on the field and record where the corrupted data lay. I would typically find extremely odd characters is a field. Removing that record was often the next challenge as moving to that record would cause Panorama to Unexpectedly Quit so it was always a matter of removing a record without touching that record. This would typcially be accomplished by using the ability to remove (delete) unselected records.

Recently though my ‘corruption’ with Panorama X has been to find odd data getting written to my db via some sort of improper write on Panorama’s back end. I’ve had a single field in a single record be filled with a sequence of numbers that I later found to be the ServerRecordID for the then selected records. This was certainly nothing that I or my procedure could have created.
I’ve also found fields within a record filled with data from another portion of that record. ie Data from fields 17-20 dropped into fields 1-4 when the only thing that my procedure was doing was reading data from field 5 for use with some type of reporting. While I was not writing any data, data was getting written. Fortunately in my case, the fields were always text fields so that while the database was not ‘corrupted’, the data was. If those fields that had gotten written to were numeric fields and text got written to them, then we would have db corruption as Panorama does not like text in numeric fields. Those types of corruption are akin to the first type of corruption mentioned earlier.

I do not think any of Panorama’s current checks for ‘corruption’ or its Data Integrity checks will find any problem with my recent finding of data getting written to text fields as only I know that a first name field which should contain something like “Tom” should not have “12,13,16,23” in it. Fortunately these errors have most often been immediately obvious. Unfortunately not always.

CooperT · September 5, 2021, 2:22pm

Unfortunately, I didn’t think of this possibility on the two occasions I had a corruption message. I will certainly try this first. I asked the user to call me a sooner and not go through dozens of crashes. Overall, I need to be more careful in collecting info around this type of event. Hopefully, next time, if there is one, I will do a better job given all that I have learned through these exchanges.