I have some ideas of where to look on this, happy to do it in the next day or two if no one else wants to look. Please do open a jira even though you can't reproduce it consistently yet, just so we have somewhere to track the efforts.
C -----Original Message----- From: Jeremy Stribling [mailto:[email protected]] Sent: Tuesday, April 12, 2011 5:11 PM To: Vishal Kher Cc: [email protected] Subject: Re: znode metadata consistency I haven't found a way to reproduce it consistently, but if I do I will update a JIRA. Until then, I was mostly just hoping that someone on-list could explain whether the meta-data of a parent node is written to disk in some consistent way at the same time that a new child node is added under the parent. I would hope that those two operations get lumped together in the same transaction, but I haven't yet been able to dig into the code myself and confirm that. On 04/12/2011 02:07 PM, Vishal Kher wrote: > Hi Jeremy, > > It would be good to open a jira for this. In addition to this > description, please explain the step to reproduce this problem. > Did this problem happen after changing membership? Can you also add to > the jira how you perform membership change? > > -Vishal > > On Tue, Apr 12, 2011 at 3:33 PM, Jeremy Stribling <[email protected] > <mailto:[email protected]>> wrote: > > Unfortunately I'm seeing this issue again in 3.3.3. A number of > node restarts resulted in a ZNODEEXISTS error when creating a new > sequential znode. I've got a 3-node setup running on Debian, and > I've collected the full logs here: > http://pdos.csail.mit.edu/~strib/zk_node_exists.tgz > <http://pdos.csail.mit.edu/%7Estrib/zk_node_exists.tgz> [24 MB, > 670 MB unzipped]. The logs are named as > nodeX.<zxid_prefixes>.log, and each new log represents an > application process restart. > > Here's the scenario: > > 1) There's a cluster with nodes 1,2,3 using zxid 0x3. > 2) All three nodes restart, forming a cluster of zxid 0x4. > 3) Node 3 restarts, leading to a cluster of 0x5. > > At this point, it seems like node 1 is the leader of the 0x5 > epoch. In its log (node1.0x4-0x5.log) you can see the first (of > many) instances of the following message: > > 2011-04-11 21:16:12,607 16649 [ProcessThread:-1] INFO > org.apache.zookeeper.server.PrepRequestProcessor - Got user-level > KeeperException when processing sessionid:0x512f466bd44e0002 > type:create cxid:0x4da376ab zxid:0xfffffffffffffffe > txntype:unknown reqpath:n/a Error > Path:/zkrsm/00000000000000b2_record0001761440 > Error:KeeperErrorCode = NodeExists for > /zkrsm/00000000000000b2_record0001761440 > > This then repeats forever as my application isn't expecting to > ever get this error message on a sequential node create, and just > continually retries. The message even transfers over to > node3.0x5-0x6.log once the 0x6 epoch comes into play. > > I don't see anything terribly fishy in the transition between the > epochs; the correct snapshots seem to be getting transferred, > etc. Unfortunately I don't have a ZK snapshot/log that exhibits > the problem when starting with a fresh system. > > Does anyone have any ideas? Maybe it's related to one of the > outstanding election JIRAs (like ZOOKEEPER-975)? Thanks, > > Jeremy > > P.S. Some oddities you might notice in these logs: > * Between epochs 0x3 and 0x4, the zookeeper IDs of the nodes > changed due to a bug in our application code. > * We manage node membership dynamically, and our application > restarts the ZooKeeperServer classes whenever a new node wants to > join (without restarting the entire application process). This is > why you'll see messages like the following in node1.0x4-0x5.log > before a new election begins: > > 2011-04-11 21:16:00,762 4804 [QuorumPeer:/0.0.0.0:2888 > <http://0.0.0.0:2888>] INFO > org.apache.zookeeper.server.quorum.Learner - shutdown called > > > > On 03/01/2011 04:09 PM, Jeremy Stribling wrote: >> My very shaky understanding from skimming those issues was that >> in some cases there were two threads handling different types of >> data that are related to the same transaction -- but maybe that's >> only true when there's a leader and a follower. But I also saw >> something in there about restoring data from a snapshot vs. >> restoring it from a log, which seems like it could have happened >> in a single node case. >> >> In any case, now the 3.3.3 is out I'll give it a try and report >> back if we keep seeing this. >> >> Thanks! >> >> Jeremy >> >> On 03/01/2011 02:59 PM, Vishal Kher wrote: >>> Hi Jeremy, >>> >>> I just realized that you are using a standalone ZK server. I >>> don't think the bugs apply to you, so I don't have an answer to >>> your question. >>> I think 3.3.3 should be released soon: >>> >>> http://zookeeper-dev.578911.n2.nabble.com/VOTE-Release-ZooKeeper-3-3-3-candidate-1-td6059109.html >>> >>> -Vishal >>> >>> On Tue, Mar 1, 2011 at 4:15 PM, Jeremy Stribling >>> <[email protected] <mailto:[email protected]>> wrote: >>> >>> Thanks for the pointers Vishal, I hadn't seen those. They >>> look like they could be related, but without knowing how >>> metadata updates are grouped into transactions, it's hard >>> for me to say. I would expect the cversion update to happen >>> within the same transaction as the creation of a new child, >>> but if they get written to the log in two separate steps, >>> perhaps these issues could explain it. >>> >>> Any estimate on when 3.3.3 will be released? I haven't seen >>> any updates on the user list about it. Thanks, >>> >>> Jeremy >>> >>> >>> On 03/01/2011 12:40 PM, Vishal Kher wrote: >>> >>> Hi Jermy, >>> >>> One of the main reasons for 3.3.3 release was to include >>> fixes for znode >>> inconsistency bugs. >>> Have you taken a look at >>> https://issues.apache.org/jira/browse/ZOOKEEPER-962and >>> https://issues.apache.org/jira/browse/ZOOKEEPER-919? >>> The problem that you are seeing sounds similar to the >>> ones reported. >>> >>> -Vishal >>> >>> >>> >>> On Mon, Feb 28, 2011 at 8:04 PM, Jeremy >>> Stribling<[email protected] <mailto:[email protected]>> >>> wrote: >>> >>> >>> Hi all, >>> >>> A while back I noticed that my Zookeeper cluster got >>> into a state where I >>> would get a "node exists" error back when creating a >>> sequential znode -- see >>> the thread starting at >>> >>> http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201010.mbox/%[email protected]%3Efor >>> more details. The summary is that at the time, my >>> application had a bug >>> >>> that could have been improperly bringing new nodes >>> into a cluster. >>> >>> However, I've seen this a couple more times since >>> fixing that original bug. >>> I don't yet know how to reproduce it, but I am >>> going to keep trying. In >>> one case, we restarted a node (in a one-node >>> cluster), and when it came back >>> up we could no longer create sequential nodes on a >>> certain parent node, with >>> a node exists (-110) error code. The biggest child >>> it saw on restart was >>> /zkrsm/000000000000002d_record0000120804 (i.e., a >>> sequence number of >>> 120804), however a stat on the parent node revealed >>> that the cversion was >>> only 120710: >>> >>> [zk:<ip:port>(CONNECTED) 3] stat /zkrsm >>> cZxid = 0x5 >>> ctime = Mon Jan 17 18:28:19 PST 2011 >>> mZxid = 0x5 >>> mtime = Mon Jan 17 18:28:19 PST 2011 >>> pZxid = 0x1d819 >>> cversion = 120710 >>> dataVersion = 0 >>> aclVersion = 0 >>> ephemeralOwner = 0x0 >>> dataLength = 0 >>> numChildren = 2955 >>> >>> So my question is: how is znode metadata persisted >>> with respect to the >>> actual znodes? Is it possible that a node's >>> children will get synced to >>> disk before its own metadata, and if it crashes at a >>> bad time, the metadata >>> updates will be lost? If so, is there any way to >>> constrain Zookeeper so >>> that it will sync its metadata before returning >>> success for write >>> operations? >>> >>> (I'm using Zookeeper 3.3.2 on a Debian Squeeze >>> 64-bit box, with >>> openjdk-6-jre 6b18-1.8.3-2.) >>> >>> I'd be happy to create a JIRA for this if that seems >>> useful, but without a >>> way to reproduce it I'm not sure that it is. >>> >>> Thanks, >>> >>> Jeremy >>> >>> >>> >>> >>> >
