Hi All,
One of my customers has a large (21GB) production zodb which they back
up onto a contingency server using repozo and rsync. The process is
roughly as follows:
1. pack the production database to 3 days once a day.
2. create a full backup with repozo and rsync this to the contingency
storage server once day.
3. on the contingency server, use repozo to turn the backup back into a
zodb and run fstest to make sure it's sane.
4. every 2 hrs throughout the day, create an incremental backup with
repozo and rsync the increments to the contingency storage server.
5. once the increments have arrived on the contingency storage server,
use repozo there to turn the full backup and the increments back into
a zodb and run fstest to make it
The process has worked fine for several months now. However, yesterday,
after repozo had combined the full backup with two increments, fstest.py
spat out the following:
Data.fs.restore
time-stamp reduction at 3102046:
0x03689aa5aa40e444 = 0x03689aea40d19233
First thought was that the times on the production storage server had
screwed up, but a lot of digging has failed to show any evidence of
this, so it looks like the above error has occurred without the usual
time slip ups.
One of the guys here applied the attached patch to fstest.py to get more
info on what was happening. I've attached a section of the output of
running this over the database.
The interesting thing is that it looks like the transactions where the
time appears to go backwards are duplicates of earlier transactions:
position in file tid time from tid
310253762330x03689abb582f1311 2006-10-03 04:43:20.668098
310253765080x03689abdbbe5f000 2006-10-03 04:45:44.038639
310253767830x03689abddbe6be55 2006-10-03 04:45:51.539377
...lots of transactions...
310256469130x03689abb582f1311 2006-10-03 04:43:20.668098
310256471880x03689abdbbe5f000 2006-10-03 04:45:44.038639
310256474630x03689abddbe6be55 2006-10-03 04:45:51.539377
Would this seem to be an accurate reading of the attached log?
If so:
1. Could repozo have a bug that resulted in this?
2. If repozo has no bug, should it have checking that makes sure it
doesn't build insane .fs files, or is fstest the way to go for that?
3. If repozo is not to blame, what could be?
cheers,
Chris
--
Simplistix - Content Management, Zope Python Consulting
- http://www.simplistix.co.uk
--- fstest.py.original 2006-08-29 11:33:50.0 +0100
+++ fstest.py2006-10-04 13:57:37.0 +0100
@@ -1,2 +1,2 @@
##
#
@@ -40,3 +40,4 @@
import string
import struct
import sys
+from ZODB.utils import p64, readable_tid_repr as rtr
class FormatError(ValueError):
There is a problem with the format of the FileStorage.
+def __repr__(self):
+return self.msg
class Status:
checkpoint = 'c'
@@ -80,1 +83,1 @@
return l
def check(path):
file = open(path, 'rb')
file.seek(0, 2)
file_size = file.tell()
if file_size == 0:
-raise FormatError(empty file)
+print FormatError(empty file)
file.seek(0)
if file.read(4) != packed_version:
-raise FormatError(invalid file header)
+print FormatError(invalid file header)
pos = 4L
tid = '\000' * 8 # lowest possible tid to start
i = 0
while pos:
_pos = pos
pos, tid = check_trec(path, file, pos, tid, file_size)
if tid is not None:
-chatter(%10d: transaction tid %s #%d \n %
-(_pos, hexify(tid), i))
+chatter(%10d: transaction tid %s tid_as_date %s #%d \n %
+(_pos, hexify(tid), rtr(p64(long(hexify(tid), 16))), i))
i = i + 1
+if i 100:
+break
def check_trec(path, file, pos, ltid, file_size):
@@ -114,4 +123,4 @@
if not h:
return None, None
if len(h) != TREC_HDR_LEN:
-raise FormatError(%s truncated at %s % (path, pos))
+print FormatError(%s truncated at %s % (path, pos))
tid, stl, status, ul, dl, el = struct.unpack(8s8scHHH, h)
tmeta_len = TREC_HDR_LEN + ul + dl + el
if tid = ltid:
-raise FormatError(%s time-stamp reduction at %s: %s = %s %
+print FormatError(%s time-stamp reduction at %s: %s = %s %
(path, pos, hexify(tid), hexify(ltid)))
ltid = tid
tl = U64(stl) # transaction record length - 8
if pos + tl + 8 file_size:
-raise FormatError(%s truncated possibly because of
+print FormatError(%s truncated possibly because of
damaged records at %s % (path, pos))
if status == Status.checkpoint:
-raise FormatError(%s checkpoint flag was not cleared at %s
+print FormatError(%s checkpoint flag was not cleared at %s