Re: [ZODB-Dev] Persistent ZEO Cache corruption?

2006-01-12 Thread Jim Fulton

Sidnei da Silva wrote:

Every now and then I face a corruption of the persistent zeo cache,
but this is the first time I get this variant.

The cause is very likely to be a forced shutdown of the box this zope
instance was running on, but I thought it would be nice to report the
issue.

Here's the traceback::

File /home/sidnei/src/zope/28five/lib/python/ZEO/ClientStorage.py,
line 314, in __init__
  self._cache.open()
File /home/sidnei/src/zope/28five/lib/python/ZEO/cache.py, line
112, in openself.fc.scan(self.install)
File /home/sidnei/src/zope/28five/lib/python/ZEO/cache.py, line
835, in scaninstall(self.f, ent)
File /home/sidnei/src/zope/28five/lib/python/ZEO/cache.py, line
121, in install
  o = Object.fromFile(f, ent.key, skip_data=True)
File /home/sidnei/src/zope/28five/lib/python/ZEO/cache.py, line
630, in fromFile
  raise ValueError(corrupted record, oid)
ValueError: corrupted record, oid

I have a copy of the zeo cache file if anyone is interested.

What is bad about this problem is that it prevented Zope from starting
and there is no obvious clue that removing the persistent zeo cache
would cure it, though that's what anyone that has a clue about what
he's doing would do *wink*.


It sounds like there should be logic in that code to abandon the
cache if a problem is found, much as we abandon file-storage index files
if anything seems suspicious.

It seems as though persistent caches haven't been a very sucessful feature.
Perhaps we should abandon them.

Jim

--
Jim Fulton   mailto:[EMAIL PROTECTED]   Python Powered!
CTO  (540) 361-1714http://www.python.org
Zope Corporation http://www.zope.com   http://www.zope.org
___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


RE: [ZODB-Dev] Persistent ZEO Cache corruption?

2006-01-12 Thread Tim Peters
[Sidnei da Silva]
 Every now and then I face a corruption of the persistent zeo cache, but
 this is the first time I get this variant.

What other variants do you see?

 The cause is very likely to be a forced shutdown of the box this zope
 instance was running on, but I thought it would be nice to report the
 issue.

Yes it is!  Thank you.  It would be better to open a bug report ;-).

 Here's the traceback::

 File /home/sidnei/src/zope/28five/lib/python/ZEO/ClientStorage.py, line
314, in __init__
   self._cache.open()
 File /home/sidnei/src/zope/28five/lib/python/ZEO/cache.py, line 112, in
open
self.fc.scan(self.install) File
 /home/sidnei/src/zope/28five/lib/python/ZEO/cache.py, line 835, in scan
install(self.f, ent) File
 /home/sidnei/src/zope/28five/lib/python/ZEO/cache.py, line 121, in
install
   o = Object.fromFile(f, ent.key, skip_data=True)
 File /home/sidnei/src/zope/28five/lib/python/ZEO/cache.py, line 630, in
fromFile
   raise ValueError(corrupted record, oid)
 ValueError: corrupted record, oid

 I have a copy of the zeo cache file if anyone is interested.

Attaching a compressed copy to the bug report would be best (if it's too big
for that, or it's proprietary, let me know how to get it and I'll put it on
an internal ZC machine).  Can't tell in advance whether that will reveal
something useful, though (see below).

 What is bad about this problem is that it prevented Zope from starting
 and there is no obvious clue that removing the persistent zeo cache
 would cure it, though that's what anyone that has a clue about what he's
 doing would do *wink*.

[Jim Fulton]
 It sounds like there should be logic in that code to abandon the cache if
 a problem is found, much as we abandon file-storage index files if
 anything seems suspicious.

That's an excellent idea, and should be doable with finite effort.

 It seems as though persistent caches haven't been a very sucessful
 feature. Perhaps we should abandon them.

They do seem to be implicated in more than their share of problems, both
before and after MVCC.

The post-MVCC ZEO persistent cache _intends_ to call flush() after each file
change.  If it's missing one of those, and depending on what forced
shutdown means exactly, that could be a systematic cause for corruption.
It doesn't call fsync() unless it's explicitly closed cleanly, but it's
unclear what good fsync() actually does across platforms when flush() is
called routinely and the power stays on.

Those were intended to be reliability improvements over the pre-MVCC file
cache (which never called flush() or fsync()).

kill -9 can do damage regardless, though.

Alas, if the cause is one of those, it's doubtful that analyzing the
corrupted file could prove it.

It's generally true that our file formats aren't designed to detect
corruption (e.g., we don't include checksums of any sort), so it's not clear
how well we can detect corruption either.  The only post-MVCC ZEO file cache
gimmick in this direction is storing 8 redundant bytes per record (the oid
for the record is stored near the start of the record, and again at the
end).  That's a lot better than nothing, and a mismatch in these redundant
oids is precisely what caused Sidnei's traceback.

The only other kinds of corruption routinely detected are weak checks:  we
hit EOF when trying to read a record's version, or when trying to read a
record's pickle.

There is one other strategy, but it only kicks in if you're not running
Python with -O (but should probably be run unconditionally at startup, and
changed to stop using assert statements):

if __debug__:
self._verify_filemap()

That checks that the info read from the cache file is mutually consistent in
various ways; Sidnei's run didn't get that far.

___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


RE: [ZODB-Dev] Persistent ZEO Cache corruption?

2006-01-12 Thread Tim Peters
[Tim Peters]
 ...
 It doesn't call fsync() unless it's explicitly closed cleanly, but it's
 unclear what good fsync() actually does across platforms when flush() is
 called routinely and the power stays on.

LOL.  Immediately upon sending that msg, there was a power outage here,
which left behind corrupted files in at least 3 (non-ZODB) applications I
had running at the time.  One of them offered to recover from its last
auto-backup, and the other two went insane.  It would certainly be nice if
persistent ZEO cache files got out of the went insane category :-(. 

___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev


Re: [ZODB-Dev] Persistent ZEO Cache corruption?

2006-01-12 Thread Dennis Allison

I have seen a similar problems bringing up a system which has been
brought down with the usual

zeoctl stop
zopectl start 

mantra.  The zope/zeo connection is via a network connection.
Next time it happens I'll file a bug report.


On Thu, 12 Jan 2006, Sidnei da Silva wrote:

 On Thu, Jan 12, 2006 at 10:17:54AM -0500, Tim Peters wrote:
 | [Sidnei da Silva]
 |  Every now and then I face a corruption of the persistent zeo cache, but
 |  this is the first time I get this variant.
 | 
 | What other variants do you see?
 
 Can't remember right now, it was quite some time ago and involved
 making changes to one zeo client while the other one was down using
 'zopectl debug'. Seen it about 6 times in different environments, so
 should be reproduceable.
 
 |  The cause is very likely to be a forced shutdown of the box this zope
 |  instance was running on, but I thought it would be nice to report the
 |  issue.
 | 
 | Yes it is!  Thank you.  It would be better to open a bug report ;-).
 
 Sure will.
 
 |  Here's the traceback::
 | 
 |  File /home/sidnei/src/zope/28five/lib/python/ZEO/ClientStorage.py, line
 | 314, in __init__
 |self._cache.open()
 |  File /home/sidnei/src/zope/28five/lib/python/ZEO/cache.py, line 112, in
 | open
 | self.fc.scan(self.install) File
 |  /home/sidnei/src/zope/28five/lib/python/ZEO/cache.py, line 835, in scan
 | install(self.f, ent) File
 |  /home/sidnei/src/zope/28five/lib/python/ZEO/cache.py, line 121, in
 | install
 |o = Object.fromFile(f, ent.key, skip_data=True)
 |  File /home/sidnei/src/zope/28five/lib/python/ZEO/cache.py, line 630, in
 | fromFile
 |raise ValueError(corrupted record, oid)
 |  ValueError: corrupted record, oid
 | 
 |  I have a copy of the zeo cache file if anyone is interested.
 | 
 | Attaching a compressed copy to the bug report would be best (if it's too big
 | for that, or it's proprietary, let me know how to get it and I'll put it on
 | an internal ZC machine).  Can't tell in advance whether that will reveal
 | something useful, though (see below).
 
 Don't think there might be anything sensitive in there, maybe my blog
 password in the worst case *wink*. Here's the files (zeo1-1.zec is
 probably the one you're after):
 
 http://awkly.org/files/zeo-cache.tar.bz2
 
 |  It seems as though persistent caches haven't been a very sucessful
 |  feature. Perhaps we should abandon them.
 | 
 | They do seem to be implicated in more than their share of problems, both
 | before and after MVCC.
 | 
 | The post-MVCC ZEO persistent cache _intends_ to call flush() after each file
 | change.  If it's missing one of those, and depending on what forced
 | shutdown means exactly, that could be a systematic cause for corruption.
 | It doesn't call fsync() unless it's explicitly closed cleanly, but it's
 | unclear what good fsync() actually does across platforms when flush() is
 | called routinely and the power stays on.
 
 Oh, I really meant to say accidental shutdown, though I wasn't around
 when the box restarted it looks like it was a power failure.
 
 

-- 

___
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev