Re: znode metadata consistency

Jeremy Stribling Tue, 12 Apr 2011 15:27:54 -0700

Ok thank you.  I've created:

https://issues.apache.org/jira/browse/ZOOKEEPER-1046


and left it unassigned for now.

Jeremy

On 04/12/2011 03:09 PM, Fournier, Camille F. [Tech] wrote:

I have some ideas of where to look on this, happy to do it in the next day or 
two if no one else wants to look. Please do open a jira even though you can't 
reproduce it consistently yet, just so we have somewhere to track the efforts.

C

-----Original Message-----
From: Jeremy Stribling [mailto:[email protected]]
Sent: Tuesday, April 12, 2011 5:11 PM
To: Vishal Kher
Cc: [email protected]
Subject: Re: znode metadata consistency

I haven't found a way to reproduce it consistently, but if I do I will
update a JIRA.  Until then, I was mostly just hoping that someone
on-list could explain whether the meta-data of a parent node is written
to disk in some consistent way at the same time that a new child node is
added under the parent.  I would hope that those two operations get
lumped together in the same transaction, but I haven't yet been able to
dig into the code myself and confirm that.

On 04/12/2011 02:07 PM, Vishal Kher wrote:

Hi Jeremy,

It would be good to open a jira for this. In addition to this
description, please explain the step to reproduce this problem.
Did this problem happen after changing membership? Can you also add to
the jira how you perform membership change?

-Vishal

On Tue, Apr 12, 2011 at 3:33 PM, Jeremy Stribling<[email protected]
<mailto:[email protected]>>  wrote:

     Unfortunately I'm seeing this issue again in 3.3.3.  A number of
     node restarts resulted in a ZNODEEXISTS error when creating a new
     sequential znode.  I've got a 3-node setup running on Debian, and
     I've collected the full logs here:
     http://pdos.csail.mit.edu/~strib/zk_node_exists.tgz
     <http://pdos.csail.mit.edu/%7Estrib/zk_node_exists.tgz>  [24 MB,
     670 MB unzipped].  The logs are named as
     nodeX.<zxid_prefixes>.log, and each new log represents an
     application process restart.

     Here's the scenario:

     1) There's a cluster with nodes 1,2,3 using zxid 0x3.
     2) All three nodes restart, forming a cluster of zxid 0x4.
     3) Node 3 restarts, leading to a cluster of 0x5.

     At this point, it seems like node 1 is the leader of the 0x5
     epoch.  In its log (node1.0x4-0x5.log) you can see the first (of
     many) instances of the following message:

     2011-04-11 21:16:12,607 16649 [ProcessThread:-1] INFO
     org.apache.zookeeper.server.PrepRequestProcessor  - Got user-level
     KeeperException when processing sessionid:0x512f466bd44e0002
     type:create cxid:0x4da376ab zxid:0xfffffffffffffffe
     txntype:unknown reqpath:n/a Error
     Path:/zkrsm/00000000000000b2_record0001761440
     Error:KeeperErrorCode = NodeExists for
     /zkrsm/00000000000000b2_record0001761440

     This then repeats forever as my application isn't expecting to
     ever get this error message on a sequential node create, and just
     continually retries.  The message even transfers over to
     node3.0x5-0x6.log once the 0x6 epoch comes into play.

     I don't see anything terribly fishy in the transition between the
     epochs; the correct snapshots seem to be getting transferred,
     etc.  Unfortunately I don't have a ZK snapshot/log that exhibits
     the problem when starting with a fresh system.

     Does anyone have any ideas?  Maybe it's related to one of the
     outstanding election JIRAs (like ZOOKEEPER-975)?  Thanks,

     Jeremy

     P.S. Some oddities you might notice in these logs:
     * Between epochs 0x3 and 0x4, the zookeeper IDs of the nodes
     changed due to a bug in our application code.
     * We manage node membership dynamically, and our application
     restarts the ZooKeeperServer classes whenever a new node wants to
     join (without restarting the entire application process).  This is
     why you'll see messages like the following in node1.0x4-0x5.log
     before a new election begins:

     2011-04-11 21:16:00,762 4804 [QuorumPeer:/0.0.0.0:2888
     <http://0.0.0.0:2888>] INFO
     org.apache.zookeeper.server.quorum.Learner  - shutdown called



     On 03/01/2011 04:09 PM, Jeremy Stribling wrote:

     My very shaky understanding from skimming those issues was that
     in some cases there were two threads handling different types of
     data that are related to the same transaction -- but maybe that's
     only true when there's a leader and a follower.  But I also saw
     something in there about restoring data from a snapshot vs.
     restoring it from a log, which seems like it could have happened
     in a single node case.

     In any case, now the 3.3.3 is out I'll give it a try and report
     back if we keep seeing this.

     Thanks!

     Jeremy

     On 03/01/2011 02:59 PM, Vishal Kher wrote:

     Hi Jeremy,

     I just realized that you are using a standalone ZK server. I
     don't think the bugs apply to you, so I don't have an answer to
     your question.
     I think 3.3.3 should be released soon:
     
http://zookeeper-dev.578911.n2.nabble.com/VOTE-Release-ZooKeeper-3-3-3-candidate-1-td6059109.html

     -Vishal

     On Tue, Mar 1, 2011 at 4:15 PM, Jeremy Stribling
     <[email protected]<mailto:[email protected]>>  wrote:

         Thanks for the pointers Vishal, I hadn't seen those.  They
         look like they could be related, but without knowing how
         metadata updates are grouped into transactions, it's hard
         for me to say.  I would expect the cversion update to happen
         within the same transaction as the creation of a new child,
         but if they get written to the log in two separate steps,
         perhaps these issues could explain it.

         Any estimate on when 3.3.3 will be released?  I haven't seen
         any updates on the user list about it.  Thanks,

         Jeremy


         On 03/01/2011 12:40 PM, Vishal Kher wrote:

             Hi Jermy,

             One of the main reasons for 3.3.3 release was to include
             fixes for znode
             inconsistency bugs.
             Have you taken a look at
             https://issues.apache.org/jira/browse/ZOOKEEPER-962and
             https://issues.apache.org/jira/browse/ZOOKEEPER-919?
             The problem that you are seeing sounds similar to the
             ones reported.

             -Vishal



             On Mon, Feb 28, 2011 at 8:04 PM, Jeremy
             Stribling<[email protected]<mailto:[email protected]>>
              wrote:


                 Hi all,

                 A while back I noticed that my Zookeeper cluster got
                 into a state where I
                 would get a "node exists" error back when creating a
                 sequential znode -- see
                 the thread starting at
                 
http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201010.mbox/%[email protected]%3Efor
                 more details.  The summary is that at the time, my
                 application had a bug

                 that could have been improperly bringing new nodes
                 into a cluster.

                 However, I've seen this a couple more times since
                 fixing that original bug.
                  I don't yet know how to reproduce it, but I am
                 going to keep trying.  In
                 one case, we restarted a node (in a one-node
                 cluster), and when it came back
                 up we could no longer create sequential nodes on a
                 certain parent node, with
                 a node exists (-110) error code.  The biggest child
                 it saw on restart was
                 /zkrsm/000000000000002d_record0000120804 (i.e., a
                 sequence number of
                 120804), however a stat on the parent node revealed
                 that the cversion was
                 only 120710:

                 [zk:<ip:port>(CONNECTED) 3] stat /zkrsm
                 cZxid = 0x5
                 ctime = Mon Jan 17 18:28:19 PST 2011
                 mZxid = 0x5
                 mtime = Mon Jan 17 18:28:19 PST 2011
                 pZxid = 0x1d819
                 cversion = 120710
                 dataVersion = 0
                 aclVersion = 0
                 ephemeralOwner = 0x0
                 dataLength = 0
                 numChildren = 2955

                 So my question is: how is znode metadata persisted
                 with respect to the
                 actual znodes?  Is it possible that a node's
                 children will get synced to
                 disk before its own metadata, and if it crashes at a
                 bad time, the metadata
                 updates will be lost?  If so, is there any way to
                 constrain Zookeeper so
                 that it will sync its metadata before returning
                 success for write
                 operations?

                 (I'm using Zookeeper 3.3.2 on a Debian Squeeze
                 64-bit box, with
                 openjdk-6-jre 6b18-1.8.3-2.)

                 I'd be happy to create a JIRA for this if that seems
                 useful, but without a
                 way to reproduce it I'm not sure that it is.

                 Thanks,

                 Jeremy

Re: znode metadata consistency

Reply via email to