Re: znode metadata consistency

Jeremy Stribling Tue, 12 Apr 2011 12:33:56 -0700

Unfortunately I'm seeing this issue again in 3.3.3. A number of noderestarts resulted in a ZNODEEXISTS error when creating a new sequentialznode. I've got a 3-node setup running on Debian, and I've collectedthe full logs here: http://pdos.csail.mit.edu/~strib/zk_node_exists.tgz[24 MB, 670 MB unzipped]. The logs are named asnodeX.<zxid_prefixes>.log, and each new log represents an applicationprocess restart.


Here's the scenario:


1) There's a cluster with nodes 1,2,3 using zxid 0x3.
2) All three nodes restart, forming a cluster of zxid 0x4.
3) Node 3 restarts, leading to a cluster of 0x5.

At this point, it seems like node 1 is the leader of the 0x5 epoch. Inits log (node1.0x4-0x5.log) you can see the first (of many) instances ofthe following message:

2011-04-11 21:16:12,607 16649 [ProcessThread:-1] INFOorg.apache.zookeeper.server.PrepRequestProcessor - Got user-levelKeeperException when processing sessionid:0x512f466bd44e0002 type:createcxid:0x4da376ab zxid:0xfffffffffffffffe txntype:unknown reqpath:n/aError Path:/zkrsm/00000000000000b2_record0001761440Error:KeeperErrorCode = NodeExists for/zkrsm/00000000000000b2_record0001761440

This then repeats forever as my application isn't expecting to ever getthis error message on a sequential node create, and just continuallyretries. The message even transfers over to node3.0x5-0x6.log once the0x6 epoch comes into play.

I don't see anything terribly fishy in the transition between theepochs; the correct snapshots seem to be getting transferred, etc.Unfortunately I don't have a ZK snapshot/log that exhibits the problemwhen starting with a fresh system.

Does anyone have any ideas? Maybe it's related to one of theoutstanding election JIRAs (like ZOOKEEPER-975)? Thanks,


Jeremy

P.S. Some oddities you might notice in these logs:

* Between epochs 0x3 and 0x4, the zookeeper IDs of the nodes changed dueto a bug in our application code.* We manage node membership dynamically, and our application restartsthe ZooKeeperServer classes whenever a new node wants to join (withoutrestarting the entire application process). This is why you'll seemessages like the following in node1.0x4-0x5.log before a new electionbegins:

2011-04-11 21:16:00,762 4804 [QuorumPeer:/0.0.0.0:2888] INFOorg.apache.zookeeper.server.quorum.Learner - shutdown called



On 03/01/2011 04:09 PM, Jeremy Stribling wrote:

My very shaky understanding from skimming those issues was that insome cases there were two threads handling different types of datathat are related to the same transaction -- but maybe that's only truewhen there's a leader and a follower. But I also saw something inthere about restoring data from a snapshot vs. restoring it from alog, which seems like it could have happened in a single node case.

In any case, now the 3.3.3 is out I'll give it a try and report backif we keep seeing this.


Thanks!

Jeremy

On 03/01/2011 02:59 PM, Vishal Kher wrote:

Hi Jeremy,

I just realized that you are using a standalone ZK server. I don'tthink the bugs apply to you, so I don't have an answer to your question.I think 3.3.3 should be released soon:http://zookeeper-dev.578911.n2.nabble.com/VOTE-Release-ZooKeeper-3-3-3-candidate-1-td6059109.html


-Vishal

On Tue, Mar 1, 2011 at 4:15 PM, Jeremy Stribling <[email protected]<mailto:[email protected]>> wrote:


    Thanks for the pointers Vishal, I hadn't seen those.  They look
    like they could be related, but without knowing how metadata
    updates are grouped into transactions, it's hard for me to say.
     I would expect the cversion update to happen within the same
    transaction as the creation of a new child, but if they get
    written to the log in two separate steps, perhaps these issues
    could explain it.

    Any estimate on when 3.3.3 will be released?  I haven't seen any
    updates on the user list about it.  Thanks,

    Jeremy


    On 03/01/2011 12:40 PM, Vishal Kher wrote:

        Hi Jermy,

        One of the main reasons for 3.3.3 release was to include
        fixes for znode
        inconsistency bugs.
        Have you taken a look at
        https://issues.apache.org/jira/browse/ZOOKEEPER-962and
        https://issues.apache.org/jira/browse/ZOOKEEPER-919?
        The problem that you are seeing sounds similar to the ones
        reported.

        -Vishal



        On Mon, Feb 28, 2011 at 8:04 PM, Jeremy
        Stribling<[email protected] <mailto:[email protected]>>  wrote:


            Hi all,

            A while back I noticed that my Zookeeper cluster got into
            a state where I
            would get a "node exists" error back when creating a
            sequential znode -- see
            the thread starting at
            
http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201010.mbox/%[email protected]%3Efor
            more details.  The summary is that at the time, my
            application had a bug

            that could have been improperly bringing new nodes into a
            cluster.

            However, I've seen this a couple more times since fixing
            that original bug.
             I don't yet know how to reproduce it, but I am going to
            keep trying.  In
            one case, we restarted a node (in a one-node cluster),
            and when it came back
            up we could no longer create sequential nodes on a
            certain parent node, with
            a node exists (-110) error code.  The biggest child it
            saw on restart was
            /zkrsm/000000000000002d_record0000120804 (i.e., a
            sequence number of
            120804), however a stat on the parent node revealed that
            the cversion was
            only 120710:

            [zk:<ip:port>(CONNECTED) 3] stat /zkrsm
            cZxid = 0x5
            ctime = Mon Jan 17 18:28:19 PST 2011
            mZxid = 0x5
            mtime = Mon Jan 17 18:28:19 PST 2011
            pZxid = 0x1d819
            cversion = 120710
            dataVersion = 0
            aclVersion = 0
            ephemeralOwner = 0x0
            dataLength = 0
            numChildren = 2955

            So my question is: how is znode metadata persisted with
            respect to the
            actual znodes?  Is it possible that a node's children
            will get synced to
            disk before its own metadata, and if it crashes at a bad
            time, the metadata
            updates will be lost?  If so, is there any way to
            constrain Zookeeper so
            that it will sync its metadata before returning success
            for write
            operations?

            (I'm using Zookeeper 3.3.2 on a Debian Squeeze 64-bit
            box, with
            openjdk-6-jre 6b18-1.8.3-2.)

            I'd be happy to create a JIRA for this if that seems
            useful, but without a
            way to reproduce it I'm not sure that it is.

            Thanks,

            Jeremy

Re: znode metadata consistency

Reply via email to