We had a rather bad error recently and I'm thinking about how to avoid it in the future. I'm sharing it and my thoughts here to see what other helpful input other might have. :)
We got a memory error in the File storage _finish method, which is called to complete the second phase of two-phase commit. This was when updating the tid cache (oid2tids), which was a dictionary that had grown rather big. (We have many millions of objects in our databases.) This occurred after the data has been written to disk. There were a number of bad outcomes of this: - The data were written to disk, but invalidations weren't sent to clients. Because the file storage was still functional, subsequent reads of these objects would return the data written. This meant the clients' view of the database was inconsistent. - The internal FileStorage meta data was partially updated. In particular, the object index was updated, but the last transaction wasn't. - The FileStorage continued to function. Subsequent commits had the same outcome, causing more damage. Fortunately, this damage was limited by a ClientStorage bug (see below). - When this error occured, the client involved was unable to commit additional transactions due to do a ClientStorage bug. ClientStorage tpc_finish doesn't handle server errors properly. It always considers a transaction finished at the end of tpc_finish. As a result, it ignored the subsequent tpc_abort call and never sent a tpc_abort call to the server. Subsequent tpc_begin calls from the client were rejected because of the outstanding transaction for the client. Despite the fact that this limited the damage of the other errors, this bug needs to be fixed. The database inconsistencies resulting from these failures have caused us a fair bit of pain. I'm taking a number of steps to avoid this failure in the future: 1. I've removed the tid cache and the save-index-after-many-writes features because they were both likely sources of errors in _finish. They were also both problematic in other ways. The tid cache consumed too much memory and the code to save the index after many writes had a flawed algorithm for deciding how often to write that caused it to never provide any benefit. Both of these features have potential benefits if done well some day. 2. We (ZC) are moving to 64-bit OSs. I've resisted this for a while due to the extra memory overhead of 64-bit pointers in Python programs, but I've finally (too late) come around to realizing that the benefit far outweighs the cost. (In this case, the process was around 900MB in size. It was probably trying to malloc a few hundred MB. The malloc failed despite the fact that there was more than 2GB of available process address space and system memory.) 3. I plan to add code to FileStorage's _finish that will, if there's an error: a. Log a critical message. b. Try to roll back the disk commit. c. Close the file storage, causing subsequent reads and writes to fail. 4. I plan to fix the client storage bug. I can see 3c being controversial. :) In particular, it means that your application will be effectively down without human intervention. I considered some other ideas: - Try to get FileStorage to repair it's meta data. This is certainly theoretically doable. For example, it could re-build it's in-memory index. At this point, that's the only thing in question. OTOH, updating it is the only thing left to fail at this point. If updating it fails, it seems likely that rebuilding it will fail as well. - Have a storage server restart when a tpc_finish call fails. This would work fine for FileStorage, but might be the wrong thing to do for another storage. The server can't know. OTOH, if there is a failure at a higher level, the server might want to restart. In particular, if the call to tpc_finish on the underlying storage has succeeded, but invalidations haven't been set, a storage server restart seems appropriate. The good news is that after doing 1, I think the chance of a failure in _finish is vastly reduced. I think that, in practice, the steps in 3, especially 3c, will never be necessary. Still, I think it's prudent to take (tested) steps to handle even this unlikely case. Comments are welcome. Jim -- Jim Fulton Zope Corporation _______________________________________________ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org http://mail.zope.org/mailman/listinfo/zodb-dev