Hi all,
A while back I noticed that my Zookeeper cluster got into a state where
I would get a "node exists" error back when creating a sequential znode
-- see the thread starting at
http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201010.mbox/%[email protected]%3E
for more details. The summary is that at the time, my application had a
bug that could have been improperly bringing new nodes into a cluster.
However, I've seen this a couple more times since fixing that original
bug. I don't yet know how to reproduce it, but I am going to keep
trying. In one case, we restarted a node (in a one-node cluster), and
when it came back up we could no longer create sequential nodes on a
certain parent node, with a node exists (-110) error code. The biggest
child it saw on restart was /zkrsm/000000000000002d_record0000120804
(i.e., a sequence number of 120804), however a stat on the parent node
revealed that the cversion was only 120710:
[zk:<ip:port>(CONNECTED) 3] stat /zkrsm
cZxid = 0x5
ctime = Mon Jan 17 18:28:19 PST 2011
mZxid = 0x5
mtime = Mon Jan 17 18:28:19 PST 2011
pZxid = 0x1d819
cversion = 120710
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 2955
So my question is: how is znode metadata persisted with respect to the
actual znodes? Is it possible that a node's children will get synced to
disk before its own metadata, and if it crashes at a bad time, the
metadata updates will be lost? If so, is there any way to constrain
Zookeeper so that it will sync its metadata before returning success for
write operations?
(I'm using Zookeeper 3.3.2 on a Debian Squeeze 64-bit box, with
openjdk-6-jre 6b18-1.8.3-2.)
I'd be happy to create a JIRA for this if that seems useful, but without
a way to reproduce it I'm not sure that it is.
Thanks,
Jeremy