Re: Possible race in LETest.java

Patrick Hunt Tue, 10 Nov 2009 22:05:19 -0800

Closing the loop - what's the status on this? Can one of you open aJIRA and provide a patch for this?


Thanks,


Patrick

Flavio Junqueira wrote:

Hi Henry, Apologies for the the delay. Your observation sounds right tome. Here is how I'm reading it; let me know if it makes sense.
If everyone votes for 3 in the second round and 3 has crashed, then incountVotes we will remove all votes to 3 and there will be no vote left.In such a case, there will be no winner as a result of the call tocountVotes and lookForLeader won't change the current vote(LeaderElection.java:201). This is a situation in which we are stuck.
Does it sound reasonable to add an "else" to the "if" statement ofLeaderElection.java:201 to reset the vote? This modification wouldimplementing resetting the vote when countVotes returns no winner, whichshould happen only when the replica itself votes for a dead leader.
-Flavio

On Oct 28, 2009, at 7:44 AM, Henry Robinson wrote:
[ Sending this direct since the Apache mailserver is rejecting mye-mails at the moment ]
As I understand it, 1 and 2 receive a vote for 3 in the first round,which causes them to vote for 3 in the second round. So in the secondround, all votes cast are for 3. But 3 has died, so all votes for itare discounted. 1 and 2 continue to vote for 3 ad infinitum, neverresetting their vote.
Does this sound plausible, or am I missing something?

cheers,
Henry
On Tue, Oct 27, 2009 at 3:48 PM, Flavio Junqueira <f...@yahoo-inc.com>wrote:Hi Henry, I don't understand how 1 and 2 do not end up electing 2 inyour situation. If they exclude 3 in countVotes, then countVotes willend up returning 2 and not 3, assuming there is a vote for 2. What amI missing?
The problem with QuorumPeer you're pointing at was also an issue withthe FLE tests, and I couldn't see an easy way around it other thantiming out and restarting leader election.
Cheers,
-Flavio


On Oct 27, 2009, at 6:35 AM, Henry Robinson wrote:

I've been working on adding a TCPResponderThread to the leader election
process so that if a deployment needs to be TCP only, it can be and still
use all election types. Testing this has exposed what might be a race
condition in the leader election code that prevents a leader from being
elected.
Here's the behaviour I see in LETest occasionally. With three nodes(reducedfrom 30 for ease of debugging), node 3 gets elected before either node1 ornode 2 finish their election (there is one round where each node that3 hasthe highest id, and then 3 completes its second round by receivingvotes for
itself from 1 and 2, but 1 and 2 do not receive votes from 3).

Now 3 is killed by the test harness. 1 and 2 are still voting for it, but
every time they try, the vote tally excludes 3 since it hasn't been heard
from. They then spin round the voting process, unable to reset theirvote. I
expect that the heartbeat mechanism in a running QuorumPeer takes care of
this when the leader is lost, but the associated QuorumPeers aren'trunning.
If this is the case, then there is a simple fix to reset the nodesvote to
themselves if they are voting for a node that hasn't been heard from. I
don't know why using TCP instead of UDP for the responder thread is
exacerbating this (and we can't rule out my introducing a bug :)); but as
it's a race condition the different timings associated with waiting ona TCP
socket might just be enough to expose the issue.

Can someone verify this might be possible / figure out what I missed?

cheers,
Henry

Re: Possible race in LETest.java

Reply via email to