You are right in pointing out the problem. We have to log the Leader
election txnid, which currently we don't. Their is open jira for that
If you log the txn id on a new leader election as well then this would not
be a problem.
In your case
1) a crashes
2) B is elected the leader. So the zxid of the ensemble moves to 2,0 and IS
LOGGED IN THE TRANSACTION LOG BY EVRYONE IN THE ENSEMBLE (this is part that
is missing in the code).
Now B starts a new PROPOSAL (2,1), B logs the PROPOSAL and moves to zxid
3) B crashes before anyone else receives the PROPOSAL.
4) C is elected as the leader the new zxid chosen by C is
3,0 (since we logged 2,0 on C as per our last leader election)
5) Now C would start a proposal (3,1) and this way we do not divurge the
I hope this helps.
On 3/30/09 1:31 PM, "rag...@yahoo.com" <rag...@yahoo.com> wrote:
> Thanks a lot for explaining this.
> I have one more corner case in mind where the transaction logs could diverge.
> I might be wrong this time as well, but would like to understand how it works.
> Reading the Leader.lead() code, it seems like the new leader reads the last
> logged zxid and bumps up the higher 32 bits while resetting the lower 32 bits.
> So this means that cascading leader crashes without a PROPOSAL in between
> would make the new leader chose the same zxid as the one before. This could
> lead to a corner case like below:
> In an ensemble of 5 servers (A, B, C, D and E), say the zxid is 1,10 (higher
> 32 bits, lower 32 bits) with A as the leader. Now the following events happen:
> 1. A crashes.
> 2. B is elected the leader. So the zxid of the ensemble moves to 2,0. If I
> read the code correctly, no one logs the new zxid until a new PROPOSAL is
> made. Now B starts a new PROPOSAL (2,1), B logs the PROPOSAL and moves to zxid
> 3. B crashes before anyone else receives the PROPOSAL.
> 4. C is elected as the leader. Since the new zxid depends on the last logged
> zxid (which is still 1,10 according to C's log), the new zxid chosen by C is
> 2,0 as well.
> 5. Now C starts a new PROPOSAL (2,1), C logs the PROPOSAL and crashes before
> anyone else has received the PROPOSAL. We have diverged logs in B and C with
> the same zxid (2,1).
> Could you tell me if this is correct?
> ----- Original Message ----
> From: Benjamin Reed <br...@yahoo-inc.com>
> To: "email@example.com" <firstname.lastname@example.org>
> Sent: Saturday, 28 March, 2009 10:49:32
> Subject: Re: Divergence in ZK transaction logs in some corner cases?
> if recover worked the way you outline, we would have a problem indeed.
> fortunately, we specifically address this case.
> the problem is in your first step. when b is elected leader, he will not
> proposal 10, he will propose 100000000000001. the zxid is made up of two
> parts, the high order bits are an epoch number and the low order bits are a
> counter. when every a new leader is elected, he will increment the epoch
> number and reset the counter.
> when A restarts you have the opposite problem, you need to make sure that A
> forgets 10 because we have skipped it and committing it will mean that 10 is
> delivered out of order. we take advantage of the epoch number in that case as
> well to make sure that A forgets about 10.
> there is some discussion about this in:
> we have a presentation as well that i'll put up that may make it more clear.
> rag...@yahoo.com wrote:
>> ZK gurus,
>> I think the ZK transaction logs can diverge from one another in some corner
>> cases. I have one such corner case listed below, could you please confirm if
>> my understanding is correct?
>> Imagine a 5 srever ensemble (A,B,C,D,E). All the servers are @ zxid 9. A is
>> the leader and it starts a new PROPOSAL (@zxid 10). A writes the proposal to
>> the log, so A moves to zxid 10. Others haven't received the PROPOSAL yet and
>> A crashes. Now the following happens:
>> 1. B is elected as the newleader. B bumps up its in-mem zxid to 10. Since
>> other nodes are at the same zxid, it sends a SNAP so that the others can
>> rebuild their data tree. In-memory zxid of all other nodes moves to 10.
>> 2. A comes back now, it accepts B as the leader as soon as the leader (B)
>> and N/2 other nodes vouch for B as the leader. So A joins the ensemble. Every
>> zookeeper node is at zxid 10.
>> 3. A new request is submitted to B. B runs PROPOSAL and COMMIT phases and the
>> cluster moves up to zxid 11. But the transaction log of A is different from
>> that of everyone else now. So the transaction logs have diverged.
>> Could you confirm if this can happen? Or am I reading the code wrong?