[Sidnei da Silva] >> Every now and then I face a corruption of the persistent zeo cache, but >> this is the first time I get this variant.
What other variants do you see? >> The cause is very likely to be a forced shutdown of the box this zope >> instance was running on, but I thought it would be nice to report the >> issue. Yes it is! Thank you. It would be better to open a bug report ;-). >> Here's the traceback:: >> >> File "/home/sidnei/src/zope/28five/lib/python/ZEO/ClientStorage.py", line 314, in __init__ >> self._cache.open() >> File "/home/sidnei/src/zope/28five/lib/python/ZEO/cache.py", line 112, in open >> self.fc.scan(self.install) File >> "/home/sidnei/src/zope/28five/lib/python/ZEO/cache.py", line 835, in scan >> install(self.f, ent) File >> "/home/sidnei/src/zope/28five/lib/python/ZEO/cache.py", line 121, in install >> o = Object.fromFile(f, ent.key, skip_data=True) >> File "/home/sidnei/src/zope/28five/lib/python/ZEO/cache.py", line 630, in fromFile >> raise ValueError("corrupted record, oid") >> ValueError: corrupted record, oid >> >> I have a copy of the zeo cache file if anyone is interested. Attaching a compressed copy to the bug report would be best (if it's too big for that, or it's proprietary, let me know how to get it and I'll put it on an internal ZC machine). Can't tell in advance whether that will reveal something useful, though (see below). >> What is bad about this problem is that it prevented Zope from starting >> and there is no obvious clue that removing the persistent zeo cache >> would cure it, though that's what anyone that has a clue about what he's >> doing would do *wink*. [Jim Fulton] > It sounds like there should be logic in that code to abandon the cache if > a problem is found, much as we abandon file-storage index files if > anything seems suspicious. That's an excellent idea, and should be doable with finite effort. > It seems as though persistent caches haven't been a very sucessful > feature. Perhaps we should abandon them. They do seem to be implicated in more than their share of problems, both before and after MVCC. The post-MVCC ZEO persistent cache _intends_ to call flush() after each file change. If it's missing one of those, and depending on what "forced shutdown" means exactly, that could be a systematic cause for corruption. It doesn't call fsync() unless it's explicitly closed cleanly, but it's unclear what good fsync() actually does across platforms when flush() is called routinely and the power stays on. Those were intended to be reliability improvements over the pre-MVCC file cache (which never called flush() or fsync()). "kill -9" can do damage regardless, though. Alas, if the cause is one of those, it's doubtful that analyzing the corrupted file could prove it. It's generally true that our file formats aren't designed to detect corruption (e.g., we don't include checksums of any sort), so it's not clear how well we can detect corruption either. The only post-MVCC ZEO file cache gimmick in this direction is storing 8 redundant bytes per record (the oid for the record is stored near the start of the record, and again at the end). That's a lot better than nothing, and a mismatch in these redundant oids is precisely what caused Sidnei's traceback. The only other kinds of corruption routinely detected are weak checks: we hit EOF when trying to read a record's version, or when trying to read a record's pickle. There is one other strategy, but it only kicks in if you're not running Python with -O (but should probably be run unconditionally at startup, and changed to stop using "assert" statements): if __debug__: self._verify_filemap() That checks that the info read from the cache file is mutually consistent in various ways; Sidnei's run didn't get that far. _______________________________________________ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org http://mail.zope.org/mailman/listinfo/zodb-dev