[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836356#action_12836356 ] Hudson commented on ZOOKEEPER-569: -- Integrated in ZooKeeper-trunk #703 (See [http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/703/]) . Failure of elected leader can lead to never-ending leader election (henry via flavio) Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.3.0 Attachments: zookeeper-569.patch, ZOOKEEPER-569.patch, zookeeper-569.patch, zookeeper-569.patch, zookeeper-569.patch, zookeeper-569.patch It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835970#action_12835970 ] Patrick Hunt commented on ZOOKEEPER-569: Henry, there are two patches, please highlight which one the review should review. thx Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.3.0 Attachments: zookeeper-569.patch, ZOOKEEPER-569.patch, zookeeper-569.patch, zookeeper-569.patch, zookeeper-569.patch, zookeeper-569.patch It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836009#action_12836009 ] Henry Robinson commented on ZOOKEEPER-569: -- The most recent patch I submitted is the right patch - it includes Flavio's suggestions. Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.3.0 Attachments: zookeeper-569.patch, ZOOKEEPER-569.patch, zookeeper-569.patch, zookeeper-569.patch, zookeeper-569.patch, zookeeper-569.patch It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836082#action_12836082 ] Hadoop QA commented on ZOOKEEPER-569: - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435629/zookeeper-569.patch against trunk revision 912052. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/68/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/68/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/68/console This message is automatically generated. Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.3.0 Attachments: zookeeper-569.patch, ZOOKEEPER-569.patch, zookeeper-569.patch, zookeeper-569.patch, zookeeper-569.patch, zookeeper-569.patch It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831203#action_12831203 ] Mahadev konar commented on ZOOKEEPER-569: - henry, are you working on a new patch adressing flavio's comments? Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.3.0 Attachments: zookeeper-569.patch, zookeeper-569.patch, zookeeper-569.patch It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831213#action_12831213 ] Henry Robinson commented on ZOOKEEPER-569: -- Yes, hoping to get it out this week. Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.3.0 Attachments: zookeeper-569.patch, zookeeper-569.patch, zookeeper-569.patch It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830521#action_12830521 ] Flavio Paiva Junqueira commented on ZOOKEEPER-569: -- Thanks, Henry, it looks good. I agree with your comment on the confusion between LE between instantiated every time it is used, and FLE behaving differently. We should really just have one model. One comment on the patch is that I don't think you need to instantiate QuorumCnxManager in mockServer() on the new test. The conditional block that checks the listener can also be removed. Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.3.0 Attachments: zookeeper-569.patch, zookeeper-569.patch, zookeeper-569.patch It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829217#action_12829217 ] Benjamin Reed commented on ZOOKEEPER-569: - i'm also wondering about the heardFrom == 0. in your case A and B will still be up, so heardFrom will not be zero. don't you really want to check whether or not you heard from guy that you think is the leader? Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.3.0 Attachments: zookeeper-569.patch, zookeeper-569.patch It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829220#action_12829220 ] Henry Robinson commented on ZOOKEEPER-569: -- Yes, you're both right! I misread my own notes on the bug :/ I'm writing tests for a *real* fix now. Thanks both for pointing this out. Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.3.0 Attachments: zookeeper-569.patch, zookeeper-569.patch It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829326#action_12829326 ] Hadoop QA commented on ZOOKEEPER-569: - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12434729/zookeeper-569.patch against trunk revision 903483. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/65/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/65/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/65/console This message is automatically generated. Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.3.0 Attachments: zookeeper-569.patch, zookeeper-569.patch, zookeeper-569.patch It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828708#action_12828708 ] Hadoop QA commented on ZOOKEEPER-569: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12434553/zookeeper-569.patch against trunk revision 903483. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/61/console This message is automatically generated. Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson Attachments: zookeeper-569.patch It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828764#action_12828764 ] Patrick Hunt commented on ZOOKEEPER-569: Mockito looks good. If someone wants to include as a testing option please enter a JIRA/patch. :-) Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.3.0 Attachments: zookeeper-569.patch, zookeeper-569.patch It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828823#action_12828823 ] Flavio Paiva Junqueira commented on ZOOKEEPER-569: -- Henry, I was taking a look at the patch, and I'm slightly confused about how it goes, so I was wondering if you could give me a hand on understanding it. It seems to me that in the situation you describe, heardFrom won't be empty, so the checking for heardFrom == 0 wouldn't work. Instead, I think you have to call countVotes and check if there is any vote left after it returns, no? Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.3.0 Attachments: zookeeper-569.patch, zookeeper-569.patch It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828860#action_12828860 ] Hadoop QA commented on ZOOKEEPER-569: - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12434555/zookeeper-569.patch against trunk revision 903483. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/62/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/62/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/62/console This message is automatically generated. Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson Fix For: 3.3.0 Attachments: zookeeper-569.patch, zookeeper-569.patch It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-569) Failure of elected leader can lead to never-ending leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799669#action_12799669 ] Flavio Paiva Junqueira commented on ZOOKEEPER-569: -- One way to implement the test is to implement a mock server to force the particular message interleaving that triggers the bug. No claim it is the best way, but it seemed to be a good idea for FLELostMessageTest. Failure of elected leader can lead to never-ending leader election -- Key: ZOOKEEPER-569 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-569 Project: Zookeeper Issue Type: Bug Reporter: Henry Robinson Assignee: Henry Robinson It is possible for basic LeaderElection to enter a situation where it never terminates. As an example, consider a three node cluster A, B and C. 1. In the first round, A votes for A, B votes for B and C votes for C 2. Since C B A, all nodes resolve to vote for C in the second round as there is no first round winner 3. A, B vote for C, but C fails. 4. C is not elected because neither A nor B hear from it, and so votes for it are discarded 5. A and B never reset their votes, despite not hearing from C, so continue to vote for it ad infinitum. Step 5 is the bug. If A and B reset their votes to themselves in the case where the heard-from vote set is empty, leader election will continue. I do not know if this affects running ZK clusters, as it is possible that the out-of-band failure detection protocols may cause leader election to be restarted anyhow, but I've certainly seen this in tests. I have a trivial patch which fixes it, but it needs a test (and tests for race conditions are hard to write!) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.