znode metadata consistency

Jeremy Stribling Mon, 28 Feb 2011 17:04:38 -0800

Hi all,

A while back I noticed that my Zookeeper cluster got into a state whereI would get a "node exists" error back when creating a sequential znode-- see the thread starting athttp://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201010.mbox/%[email protected]%3Efor more details. The summary is that at the time, my application had abug that could have been improperly bringing new nodes into a cluster.

However, I've seen this a couple more times since fixing that originalbug. I don't yet know how to reproduce it, but I am going to keeptrying. In one case, we restarted a node (in a one-node cluster), andwhen it came back up we could no longer create sequential nodes on acertain parent node, with a node exists (-110) error code. The biggestchild it saw on restart was /zkrsm/000000000000002d_record0000120804(i.e., a sequence number of 120804), however a stat on the parent noderevealed that the cversion was only 120710:


[zk:<ip:port>(CONNECTED) 3] stat /zkrsm
cZxid = 0x5
ctime = Mon Jan 17 18:28:19 PST 2011
mZxid = 0x5
mtime = Mon Jan 17 18:28:19 PST 2011
pZxid = 0x1d819
cversion = 120710
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 2955

So my question is: how is znode metadata persisted with respect to theactual znodes? Is it possible that a node's children will get synced todisk before its own metadata, and if it crashes at a bad time, themetadata updates will be lost? If so, is there any way to constrainZookeeper so that it will sync its metadata before returning success forwrite operations?

(I'm using Zookeeper 3.3.2 on a Debian Squeeze 64-bit box, withopenjdk-6-jre 6b18-1.8.3-2.)

I'd be happy to create a JIRA for this if that seems useful, but withouta way to reproduce it I'm not sure that it is.


Thanks,

Jeremy

znode metadata consistency

Reply via email to