[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henry Robinson updated ZOOKEEPER-569:
-------------------------------------

    Attachment: zookeeper-569.patch

I'm making really heavy weather of this :) Flavio, thanks, you're completely 
right. You also exposed a weird issue that the listener was actually causing 
the test to succeed no matter what messages were sent - but if the fix for this 
issue was not in place the test would fail. This is fixed by using the right 
port for the DatagramSocket. 

New patch hopefully has the DatagramSocket listening on the right port, with 
everything else in order.

> Failure of elected leader can lead to never-ending leader election
> ------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-569
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569
>             Project: Zookeeper
>          Issue Type: Bug
>            Reporter: Henry Robinson
>            Assignee: Henry Robinson
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-569.patch, ZOOKEEPER-569.patch, 
> zookeeper-569.patch, zookeeper-569.patch, zookeeper-569.patch, 
> zookeeper-569.patch
>
>
> It is possible for basic LeaderElection to enter a situation where it never 
> terminates. 
> As an example, consider a three node cluster A, B and C.
> 1. In the first round, A votes for A, B votes for B and C votes for C
> 2. Since C > B > A, all nodes resolve to vote for C in the second round as 
> there is no first round winner
> 3. A, B vote for C, but C fails.
> 4. C is not elected because neither A nor B hear from it, and so votes for it 
> are discarded
> 5. A and B never reset their votes, despite not hearing from C, so continue 
> to vote for it ad infinitum. 
> Step 5 is the bug. If A and B reset their votes to themselves in the case 
> where the heard-from vote set is empty, leader election will continue.
> I do not know if this affects running ZK clusters, as it is possible that the 
> out-of-band failure detection protocols may cause leader election to be 
> restarted anyhow, but I've certainly seen this in tests. 
> I have a trivial patch which fixes it, but it needs a test (and tests for 
> race conditions are hard to write!)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to