RE: znode metadata consistency

Fournier, Camille F. [Tech] Tue, 12 Apr 2011 15:10:31 -0700

I have some ideas of where to look on this, happy to do it in the next day or 
two if no one else wants to look. Please do open a jira even though you can't 
reproduce it consistently yet, just so we have somewhere to track the efforts.


C

-----Original Message-----
From: Jeremy Stribling [mailto:[email protected]] 
Sent: Tuesday, April 12, 2011 5:11 PM
To: Vishal Kher
Cc: [email protected]
Subject: Re: znode metadata consistency

I haven't found a way to reproduce it consistently, but if I do I will 
update a JIRA.  Until then, I was mostly just hoping that someone 
on-list could explain whether the meta-data of a parent node is written 
to disk in some consistent way at the same time that a new child node is 
added under the parent.  I would hope that those two operations get 
lumped together in the same transaction, but I haven't yet been able to 
dig into the code myself and confirm that.

On 04/12/2011 02:07 PM, Vishal Kher wrote:
> Hi Jeremy,
>
> It would be good to open a jira for this. In addition to this 
> description, please explain the step to reproduce this problem.
> Did this problem happen after changing membership? Can you also add to 
> the jira how you perform membership change?
>
> -Vishal
>
> On Tue, Apr 12, 2011 at 3:33 PM, Jeremy Stribling <[email protected] 
> <mailto:[email protected]>> wrote:
>
>     Unfortunately I'm seeing this issue again in 3.3.3.  A number of
>     node restarts resulted in a ZNODEEXISTS error when creating a new
>     sequential znode.  I've got a 3-node setup running on Debian, and
>     I've collected the full logs here:
>     http://pdos.csail.mit.edu/~strib/zk_node_exists.tgz
>     <http://pdos.csail.mit.edu/%7Estrib/zk_node_exists.tgz> [24 MB,
>     670 MB unzipped].  The logs are named as
>     nodeX.<zxid_prefixes>.log, and each new log represents an
>     application process restart.
>
>     Here's the scenario:
>
>     1) There's a cluster with nodes 1,2,3 using zxid 0x3.
>     2) All three nodes restart, forming a cluster of zxid 0x4.
>     3) Node 3 restarts, leading to a cluster of 0x5.
>
>     At this point, it seems like node 1 is the leader of the 0x5
>     epoch.  In its log (node1.0x4-0x5.log) you can see the first (of
>     many) instances of the following message:
>
>     2011-04-11 21:16:12,607 16649 [ProcessThread:-1] INFO
>     org.apache.zookeeper.server.PrepRequestProcessor  - Got user-level
>     KeeperException when processing sessionid:0x512f466bd44e0002
>     type:create cxid:0x4da376ab zxid:0xfffffffffffffffe
>     txntype:unknown reqpath:n/a Error
>     Path:/zkrsm/00000000000000b2_record0001761440
>     Error:KeeperErrorCode = NodeExists for
>     /zkrsm/00000000000000b2_record0001761440
>
>     This then repeats forever as my application isn't expecting to
>     ever get this error message on a sequential node create, and just
>     continually retries.  The message even transfers over to
>     node3.0x5-0x6.log once the 0x6 epoch comes into play.
>
>     I don't see anything terribly fishy in the transition between the
>     epochs; the correct snapshots seem to be getting transferred,
>     etc.  Unfortunately I don't have a ZK snapshot/log that exhibits
>     the problem when starting with a fresh system.
>
>     Does anyone have any ideas?  Maybe it's related to one of the
>     outstanding election JIRAs (like ZOOKEEPER-975)?  Thanks,
>
>     Jeremy
>
>     P.S. Some oddities you might notice in these logs:
>     * Between epochs 0x3 and 0x4, the zookeeper IDs of the nodes
>     changed due to a bug in our application code.
>     * We manage node membership dynamically, and our application
>     restarts the ZooKeeperServer classes whenever a new node wants to
>     join (without restarting the entire application process).  This is
>     why you'll see messages like the following in node1.0x4-0x5.log
>     before a new election begins:
>
>     2011-04-11 21:16:00,762 4804 [QuorumPeer:/0.0.0.0:2888
>     <http://0.0.0.0:2888>] INFO
>     org.apache.zookeeper.server.quorum.Learner  - shutdown called
>
>
>
>     On 03/01/2011 04:09 PM, Jeremy Stribling wrote:
>>     My very shaky understanding from skimming those issues was that
>>     in some cases there were two threads handling different types of
>>     data that are related to the same transaction -- but maybe that's
>>     only true when there's a leader and a follower.  But I also saw
>>     something in there about restoring data from a snapshot vs.
>>     restoring it from a log, which seems like it could have happened
>>     in a single node case.
>>
>>     In any case, now the 3.3.3 is out I'll give it a try and report
>>     back if we keep seeing this.
>>
>>     Thanks!
>>
>>     Jeremy
>>
>>     On 03/01/2011 02:59 PM, Vishal Kher wrote:
>>>     Hi Jeremy,
>>>
>>>     I just realized that you are using a standalone ZK server. I
>>>     don't think the bugs apply to you, so I don't have an answer to
>>>     your question.
>>>     I think 3.3.3 should be released soon:
>>>     
>>> http://zookeeper-dev.578911.n2.nabble.com/VOTE-Release-ZooKeeper-3-3-3-candidate-1-td6059109.html
>>>
>>>     -Vishal
>>>
>>>     On Tue, Mar 1, 2011 at 4:15 PM, Jeremy Stribling
>>>     <[email protected] <mailto:[email protected]>> wrote:
>>>
>>>         Thanks for the pointers Vishal, I hadn't seen those.  They
>>>         look like they could be related, but without knowing how
>>>         metadata updates are grouped into transactions, it's hard
>>>         for me to say.  I would expect the cversion update to happen
>>>         within the same transaction as the creation of a new child,
>>>         but if they get written to the log in two separate steps,
>>>         perhaps these issues could explain it.
>>>
>>>         Any estimate on when 3.3.3 will be released?  I haven't seen
>>>         any updates on the user list about it.  Thanks,
>>>
>>>         Jeremy
>>>
>>>
>>>         On 03/01/2011 12:40 PM, Vishal Kher wrote:
>>>
>>>             Hi Jermy,
>>>
>>>             One of the main reasons for 3.3.3 release was to include
>>>             fixes for znode
>>>             inconsistency bugs.
>>>             Have you taken a look at
>>>             https://issues.apache.org/jira/browse/ZOOKEEPER-962and
>>>             https://issues.apache.org/jira/browse/ZOOKEEPER-919?
>>>             The problem that you are seeing sounds similar to the
>>>             ones reported.
>>>
>>>             -Vishal
>>>
>>>
>>>
>>>             On Mon, Feb 28, 2011 at 8:04 PM, Jeremy
>>>             Stribling<[email protected] <mailto:[email protected]>>
>>>              wrote:
>>>
>>>
>>>                 Hi all,
>>>
>>>                 A while back I noticed that my Zookeeper cluster got
>>>                 into a state where I
>>>                 would get a "node exists" error back when creating a
>>>                 sequential znode -- see
>>>                 the thread starting at
>>>                 
>>> http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201010.mbox/%[email protected]%3Efor
>>>                 more details.  The summary is that at the time, my
>>>                 application had a bug
>>>
>>>                 that could have been improperly bringing new nodes
>>>                 into a cluster.
>>>
>>>                 However, I've seen this a couple more times since
>>>                 fixing that original bug.
>>>                  I don't yet know how to reproduce it, but I am
>>>                 going to keep trying.  In
>>>                 one case, we restarted a node (in a one-node
>>>                 cluster), and when it came back
>>>                 up we could no longer create sequential nodes on a
>>>                 certain parent node, with
>>>                 a node exists (-110) error code.  The biggest child
>>>                 it saw on restart was
>>>                 /zkrsm/000000000000002d_record0000120804 (i.e., a
>>>                 sequence number of
>>>                 120804), however a stat on the parent node revealed
>>>                 that the cversion was
>>>                 only 120710:
>>>
>>>                 [zk:<ip:port>(CONNECTED) 3] stat /zkrsm
>>>                 cZxid = 0x5
>>>                 ctime = Mon Jan 17 18:28:19 PST 2011
>>>                 mZxid = 0x5
>>>                 mtime = Mon Jan 17 18:28:19 PST 2011
>>>                 pZxid = 0x1d819
>>>                 cversion = 120710
>>>                 dataVersion = 0
>>>                 aclVersion = 0
>>>                 ephemeralOwner = 0x0
>>>                 dataLength = 0
>>>                 numChildren = 2955
>>>
>>>                 So my question is: how is znode metadata persisted
>>>                 with respect to the
>>>                 actual znodes?  Is it possible that a node's
>>>                 children will get synced to
>>>                 disk before its own metadata, and if it crashes at a
>>>                 bad time, the metadata
>>>                 updates will be lost?  If so, is there any way to
>>>                 constrain Zookeeper so
>>>                 that it will sync its metadata before returning
>>>                 success for write
>>>                 operations?
>>>
>>>                 (I'm using Zookeeper 3.3.2 on a Debian Squeeze
>>>                 64-bit box, with
>>>                 openjdk-6-jre 6b18-1.8.3-2.)
>>>
>>>                 I'd be happy to create a JIRA for this if that seems
>>>                 useful, but without a
>>>                 way to reproduce it I'm not sure that it is.
>>>
>>>                 Thanks,
>>>
>>>                 Jeremy
>>>
>>>
>>>
>>>
>>>
>

RE: znode metadata consistency

Reply via email to