We had a rather bad error recently and I'm thinking about how to avoid
it in the future. I'm sharing it and my thoughts here to see what
other helpful input other might have. :)
We got a memory error in the File storage _finish method, which is
called to complete the second phase of two-phase commit. This was
when updating the tid cache (oid2tids), which was a dictionary that
had grown rather big. (We have many millions of objects in our
databases.) This occurred after the data has been written to disk.
There were a number of bad outcomes of this:
- The data were written to disk, but invalidations weren't sent to
clients. Because the file storage was still functional, subsequent
reads of these objects would return the data written. This meant the
clients' view of the database was inconsistent.
- The internal FileStorage meta data was partially updated. In
particular, the object index was updated, but the last transaction
- The FileStorage continued to function. Subsequent commits had the
same outcome, causing more damage. Fortunately, this damage was
limited by a ClientStorage bug (see below).
- When this error occured, the client involved was unable to commit
additional transactions due to do a ClientStorage bug. ClientStorage
tpc_finish doesn't handle server errors properly. It always considers
a transaction finished at the end of tpc_finish. As a result, it
ignored the subsequent tpc_abort call and never sent a tpc_abort call
to the server. Subsequent tpc_begin calls from the client were
rejected because of the outstanding transaction for the client.
Despite the fact that this limited the damage of the other errors,
this bug needs to be fixed.
The database inconsistencies resulting from these failures have caused
us a fair bit of pain.
I'm taking a number of steps to avoid this failure in the future:
1. I've removed the tid cache and the save-index-after-many-writes
features because they were both likely sources of errors in _finish.
They were also both problematic in other ways. The tid cache consumed
too much memory and the code to save the index after many writes had a
flawed algorithm for deciding how often to write that caused it to
never provide any benefit. Both of these features have potential
benefits if done well some day.
2. We (ZC) are moving to 64-bit OSs. I've resisted this for a while
due to the extra memory overhead of 64-bit pointers in Python
programs, but I've finally (too late) come around to realizing that
the benefit far outweighs the cost. (In this case, the process was
around 900MB in size. It was probably trying to malloc a few hundred
MB. The malloc failed despite the fact that there was more than 2GB
of available process address space and system memory.)
3. I plan to add code to FileStorage's _finish that will, if there's
a. Log a critical message.
b. Try to roll back the disk commit.
c. Close the file storage, causing subsequent reads and writes to
4. I plan to fix the client storage bug.
I can see 3c being controversial. :) In particular, it means that your
application will be effectively down without human intervention.
I considered some other ideas:
- Try to get FileStorage to repair it's meta data. This is certainly
theoretically doable. For example, it could re-build it's in-memory
index. At this point, that's the only thing in question. OTOH,
updating it is the only thing left to fail at this point. If updating
it fails, it seems likely that rebuilding it will fail as well.
- Have a storage server restart when a tpc_finish call fails. This
would work fine for FileStorage, but might be the wrong thing to do
for another storage. The server can't know.
OTOH, if there is a failure at a higher level, the server might
want to restart. In particular, if the call to tpc_finish on the
underlying storage has succeeded, but invalidations haven't been set,
a storage server restart seems appropriate.
The good news is that after doing 1, I think the chance of a failure
in _finish is vastly reduced. I think that, in practice, the steps in
3, especially 3c, will never be necessary. Still, I think it's
prudent to take (tested) steps to handle even this unlikely case.
Comments are welcome.
For more information about ZODB, see the ZODB Wiki:
ZODB-Dev mailing list - ZODB-Dev@zope.org