Hi all,

A while back I noticed that my Zookeeper cluster got into a state where I would get a "node exists" error back when creating a sequential znode -- see the thread starting at http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201010.mbox/%[email protected]%3E for more details. The summary is that at the time, my application had a bug that could have been improperly bringing new nodes into a cluster.

However, I've seen this a couple more times since fixing that original bug. I don't yet know how to reproduce it, but I am going to keep trying. In one case, we restarted a node (in a one-node cluster), and when it came back up we could no longer create sequential nodes on a certain parent node, with a node exists (-110) error code. The biggest child it saw on restart was /zkrsm/000000000000002d_record0000120804 (i.e., a sequence number of 120804), however a stat on the parent node revealed that the cversion was only 120710:

[zk:<ip:port>(CONNECTED) 3] stat /zkrsm
cZxid = 0x5
ctime = Mon Jan 17 18:28:19 PST 2011
mZxid = 0x5
mtime = Mon Jan 17 18:28:19 PST 2011
pZxid = 0x1d819
cversion = 120710
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 2955

So my question is: how is znode metadata persisted with respect to the actual znodes? Is it possible that a node's children will get synced to disk before its own metadata, and if it crashes at a bad time, the metadata updates will be lost? If so, is there any way to constrain Zookeeper so that it will sync its metadata before returning success for write operations?

(I'm using Zookeeper 3.3.2 on a Debian Squeeze 64-bit box, with openjdk-6-jre 6b18-1.8.3-2.)

I'd be happy to create a JIRA for this if that seems useful, but without a way to reproduce it I'm not sure that it is.

Thanks,

Jeremy

Reply via email to