Reading the Sharing Logs

I’m trying to help out on the forensics of a large data loss on a shared file. I’ve found what appears to be a significant clue in the sharing logs but can’t find anything definitive in how to read the logs. (There is an explanation in Help of the Log fields for Web Serving)

My take is that the “ts:” represents the number of records in the database when opened followed by the number after synching. My description below is based on that interpretation.

The highlighted record in the screenshot below fits the time when the problem began. It opened and synched with very different numbers. Subsequently other databases opening with the previous 795 records were all being dropped. Sessions 55, 7, and 8 were all the same user. Sessions 54, 5, 9, and 10 were by another user who experienced crashes on every attempt to open the file after the big change. Session 3 was a routine, automated opening and synch by Panorama. It was not unexpected that Sesson 3 opened with far fewer records, but it should have bumped up to the 795. Instead, it crashed. All of those with the 446 crashed.

What conditions would allow a database that was clearly “off”, to open and synchronize in this manner?

A bit earlier, the last synch by the user in Session 55 was:

Session 50 *** synchronized database: Distribution (ts:521-736 changed:95 deleted:11)

The numbers fit right in. The 644-840 do not. How the 446 seems to have worked its way in is also puzzling.

I’m trying to help out on the forensics of a large data loss on a shared file. I’ve found what appears to be a significant clue in the sharing logs

Let me start by saying that the sharing logs weren’t designed to be of any use for this sort of forensic purpose. I’m not sure what your goal is, but if you are somehow thinking that you can recover lost data from information in the logs, that isn’t something I would spend any time on myself.

These logs were primarily of use to me in debugging this code five years or so ago. But even five years ago when this was all fresh in my mind, I would mostly look at one of these synchronized log entries immediately after synchronization, to see if it matched what seemed to be happening. I don’t think I could ever have looked at a historical series of log entries and reverse engineered what had happened.

can’t find anything definitive in how to read the logs.

I’m pretty sure there is no written information about this, and as I said above, these logs weren’t really intended to contain enough information for the kind of forensic deep dive you are envisioning.

That said, I do have a vague recollection that I might have discussed these log entries when doing the video classes. I think this would most likely be in the Panorama X Server Deep Dive Part 4 video.

My take is that the “ts:” represents the number of records in the database

These numbers have nothing to do with the number of records. “ts” is short for “time stamp”. Panorama X time stamps aren’t actually based on clock time, but they do keep track of the order of events. So an event that happened at ts = 7 definitely occurred after an event at ts = 6. Every time any modification is made to a shared database, Panorama increments the internal time stamp.

If you take a new database and just add records to it, no other modifications, then the time stamp will correspond to the number of records. But usually other modifications are also happening - editing cells, deleting records, etc. So the time stamp can and does increase even if the number of records doesn’t go up - or even if it goes down.

Also, the time stamp will reset when a critical new generation is done. If your log isn’t recording new generations, you won’t see that in the log, but it could appear that time stamps are out of order.

followed by the number after synching

There are two numbers after “ts:” separated by a dash. Looking at the source code, I believe the first number is the time stamp when this client last requested a sync, in other words the client is telling the server “I am already synchronized up to this point”.

The second number is the time stamp the server has for this database. So the server will supply all of the changes between the client’s time stamp and the server’s. After the synchronization both the client and the server will have the same time stamp.

The highlighted record in the screenshot below fits the time when the problem began.

Ignore the first number, it’s the second number that is interesting. It went from 795 to 840 to 446. This number should NEVER go down, only up. The only exception would be if there was a new sharing generation. I don’t know if your logs are set up to record new sharing generations. I think you have to have Database Management checked under logging options. Based on the logs I would think there would have to have been TWO new sharing generations, one between 2:39 PM yesterday and 8:30 AM today, and another between 8:34 AM and 9:23 AM.

It’s interesting that the session numbers go from 55 to 3. Perhaps this means that the server restarted? Though if so, what happened to sessions 1, 2, 4 and 6? Hmm - this log is from THIS MORNING, so perhaps someone remembers whether the server restarted. Though restarting the server should not cause a synchronization problem, unless someone manually fiddled with the server files. Also, hopefully someone remembers whether any new sharing generations were done in the past 24 hours.

What are the settings for Auto Save Timeout and Save Transaction journal for this server? These control whether or not the server data is protected if the server crashes. Though they would have to be really wacky to cause log entries shown, I don’t think it would even be possible. But I have to ask.

I wasn‘t looking into the logs with any expectation of data recovery. The purpose was to see if I could determine when an error occurred, what user was involved, and any other clues that could help prevent whatever from happening again.

In the past, the logs have provided a lot of useful information in that regard.

I did look for any New Generations, and the last was some days earlier.

There was a restart when the web interface didn’t appear to be working right. I wasn’t involved and had no chance to review the exact status of Panorama, Panorama Server, or Abyss. That’s why the session numbers were reset. I’m very sure the individual involved would not have done anything with the server files, but I will check. On one side of the restart or the other, the crashing of this file began.

Not every user opens every file so those other sessions were using different files than the one in question. I imported the logs into a database and selected only the entries involving the file that was crashing and missing a couple of days worth of entries.

Auto Save is at 1. Save Transaction Journal was unchecked. I took the liberty to check it just now since that may have avoided the loss - assuming I understand its function.

All that said, we still don’t know what caused the issue and the missing data.

I imported the logs into a database and selected only the entries involving the file

Aha, it hadn’t occurred to me that only a subset of log entries were shown, but it makes perfect sense.

when the web interface

I guess we have to consider the possibility that a web procedure caused this. I don’t know of any way for a web procedure to cause a problem like this, in fact if I wanted to write a web procedure to cause such a problem I would be hard pressed to do so (actually come to think of it I could, but only with undocumented features that are used internally). But there isn’t any real way to put complete guardrails around web procedure code - that is code that is running directly on the server. Obviously I have no idea what your code does or how it is written, so I can’t really provide any direct advice on this. But you probably have an idea if there is any web procedure code that does anything unusual. You might want to consider adding logging to any web procedure code that does something complicated. Then if this happens again, you would have more detailed information to review.

The database with the lost days of records is shared but has no interface with the web at all. A couple of small, unshared databases handle all web interactions and are entirely separate, other than being handled by the same Panorama Server. I can’t see any reason that web actions would be involved in the sharing error here unless there’s some strange bug at work.

I continue to see the logs as a very useful tool in following up on issues and errors, and many times in the past they have served that purpose very well.

Per your observations, the 795 to 840 to 446 noted during synchronizing show where the loss likely occurred. We still don’t know how or why. Apparently, Session 55 or 3 was where it occurred. But would it likely be the local copy, or the server copy that made the 466 become the “official” ts? What could allow it to go from 840 to 446? That seems to be the main issue.

One more question about what the logs show…The highlighted record in my screenshot shows 520 changes. Are those changes that the synchronization made to the local copy? Similarly, would deletions then represent the number of records deleted by the synch?