Thanks for the pointers Vishal, I hadn't seen those. They look like
they could be related, but without knowing how metadata updates are
grouped into transactions, it's hard for me to say. I would expect the
cversion update to happen within the same transaction as the creation of
a new child, but if they get written to the log in two separate steps,
perhaps these issues could explain it.
Any estimate on when 3.3.3 will be released? I haven't seen any updates
on the user list about it. Thanks,
Jeremy
On 03/01/2011 12:40 PM, Vishal Kher wrote:
Hi Jermy,
One of the main reasons for 3.3.3 release was to include fixes for znode
inconsistency bugs.
Have you taken a look at https://issues.apache.org/jira/browse/ZOOKEEPER-962and
https://issues.apache.org/jira/browse/ZOOKEEPER-919?
The problem that you are seeing sounds similar to the ones reported.
-Vishal
On Mon, Feb 28, 2011 at 8:04 PM, Jeremy Stribling<[email protected]> wrote:
Hi all,
A while back I noticed that my Zookeeper cluster got into a state where I
would get a "node exists" error back when creating a sequential znode -- see
the thread starting at
http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201010.mbox/%[email protected]%3Efor
more details. The summary is that at the time, my application had a bug
that could have been improperly bringing new nodes into a cluster.
However, I've seen this a couple more times since fixing that original bug.
I don't yet know how to reproduce it, but I am going to keep trying. In
one case, we restarted a node (in a one-node cluster), and when it came back
up we could no longer create sequential nodes on a certain parent node, with
a node exists (-110) error code. The biggest child it saw on restart was
/zkrsm/000000000000002d_record0000120804 (i.e., a sequence number of
120804), however a stat on the parent node revealed that the cversion was
only 120710:
[zk:<ip:port>(CONNECTED) 3] stat /zkrsm
cZxid = 0x5
ctime = Mon Jan 17 18:28:19 PST 2011
mZxid = 0x5
mtime = Mon Jan 17 18:28:19 PST 2011
pZxid = 0x1d819
cversion = 120710
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 2955
So my question is: how is znode metadata persisted with respect to the
actual znodes? Is it possible that a node's children will get synced to
disk before its own metadata, and if it crashes at a bad time, the metadata
updates will be lost? If so, is there any way to constrain Zookeeper so
that it will sync its metadata before returning success for write
operations?
(I'm using Zookeeper 3.3.2 on a Debian Squeeze 64-bit box, with
openjdk-6-jre 6b18-1.8.3-2.)
I'd be happy to create a JIRA for this if that seems useful, but without a
way to reproduce it I'm not sure that it is.
Thanks,
Jeremy