Hi Raghu, I'm glad that you going deep into the code, so thanks for
all your questions.
In your description below, you say that a faulty peer gets back. If it
gets back, then it is not faulty, right? To be more concrete, suppose
that this peer was partitioned away, and it now ends up being elected
because it has a zxid higher than everyone else. I would say that this
behavior is correct.
The behavior we are trying to avoid is different. Suppose that the
current leader CL crashes and a new leader NL arises. Suppose also
that NL crashes before followers are able to connect to NL. In this
case, followers will move on to another round of leader election. If
there is a slow process that didn't finish the election of NL, but has
NL as its current candidate, then it will propose it again and without
the notion of rounds, server will accept the notification from the
slow process.
Thanks,
-Flavio
On Apr 14, 2009, at 5:10 PM, rag...@yahoo.com wrote:
Falvio,
Thanks for explaining this.
When the faulty peer gets back and attempts to propose itself as the
leader, it's clear that all the other peers don't consider its
proposal and notify the faulty peer that they are in a higher epoch.
However, the faulty peer will sync up its logical clock upon
receiving the first notification from a higher epoch and resend a
proposal notification to all with itself as the proposed leader
(because it's zxid is higher). If the other peers haven't completed
the election loop by the time the updated notificaiton is received
from the faulty peer, they will succumb again, update their proposal
record and send notifications to all others with faulty peer as the
proposed leader.
So the logical clock only seems to be buying some time here, rather
than completely eliminating the faulty peer. The code seems to be
hoping that the rest of the peers will complete their election loop
and start following a new leader by the time the faulty peer syncs
up its logical clock and notifies other peers. Is my understanding
correct?
-Raghu
----- Original Message ----
From: Flavio Junqueira <f...@yahoo-inc.com>
To: zookeeper-dev@hadoop.apache.org
Cc: rag...@yahoo.com
Sent: Monday, 13 April, 2009 15:08:10
Subject: Re: FastLeaderElection
Hi Raghu, Upon multiple consecutive crashes (or perhaps a network
partition), it is possible that we keep electing a faulty server if
we only use zxid. We avoid such a problem using a logical clock as
servers only consider changing their proposals if they received a
notification from the same or a later epoch. With this mechanism, if
an elected server crashes before exercising its role as a leader, it
won't be considered in later epochs. Without a logical clock, a
server lagging behind in the election could re-introduce the faulty
server into the election, and it would be elected again if the
faulty server is the one with highest zxid.
Note that we are not using "logical clocks" in the sense of Lamport
clocks. We are not incrementing upon every event, but instead only
counting rounds of leader election.
-Flavio
On Apr 13, 2009, at 8:55 PM, rag...@yahoo.com wrote:
Could someone please throw some light on this? Thanks.
-Raghu
----- Original Message ----
From: "rag...@yahoo.com" <rag...@yahoo.com>
To: zookeeper-u...@hadoop.apache.org
Sent: Friday, 10 April, 2009 8:11:34
Subject: FastLeaderElection
Hi,
Could someone please explain quickly why logical clock is used in
FastLeaderElection? It looks to me like the peers can converge on a
leader (with highest zxid or server id if zxids are the same) even
without the logical clock. May be I am missing something here, I
could not figure out why logical clock is needed.
Thanks
Raghu