[jira] Updated: (ZOOKEEPER-275) Bug in FastLeaderElection

2009-01-29 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-275:


Status: Open  (was: Patch Available)

the patch does not apply any mroe. I tried creating a new patch but its a 
little tricky so I didnt want to mess up the patch. Can you regenerate the 
patch Flavio?

 Bug in FastLeaderElection
 -

 Key: ZOOKEEPER-275
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-275
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.0.1, 3.0.0
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
 Fix For: 3.1.0

 Attachments: ZOOKEEPER-275.patch


 I found an execution in which leader election does not make progress. Here is 
 the problematic scenario:
 - We have an ensemble of 3 servers, and we start only 2;
 - We let them elect a leader, and then crash the one with lowest id, say S_1 
 (call the other S_2);
 - We restart the crashed server.
 Upon restarting S_1, S_2 has its logical clock more advanced, and S_1 has its 
 logical clock set to 1. Once S_1 receives a notification from S_2, it notices 
 that it is in the wrong round and it advances its logical clock to the same 
 value as S_1. Now, the problem comes exactly in this point because in the 
 current code S_1 resets its vote to its initial vote (its own id and zxid). 
 Since S_2 has already notified S_1, it won't do it again, and we are stuck. 
 The patch I'm submitting fixes this problem by setting the vote of S_1 to the 
 one received if it satisfies the total order predicate (received zxid is 
 higher or received zxid is the same and received id is higher).
 Related to this problem, I noticed that by trying to avoid unnecessary 
 notification duplicates, there could be scenarios in which a server fails 
 before electing a leader and restarts before leader election succeeds. This 
 could happen, for example, when there isn't enough servers available and one 
 available crashes and restarts. I fixed this problem in the attached patch by 
 allowing a server to send a new batch of notifications if there is at least 
 one outgoing queue of pending notifications empty. This is ok because we 
 space out consecutive batches of notifications. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-275) Bug in FastLeaderElection

2009-01-29 Thread Flavio Paiva Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Paiva Junqueira updated ZOOKEEPER-275:
-

Attachment: ZOOKEEPER-275.patch

 Bug in FastLeaderElection
 -

 Key: ZOOKEEPER-275
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-275
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.0.0, 3.0.1
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
 Fix For: 3.1.0

 Attachments: ZOOKEEPER-275.patch, ZOOKEEPER-275.patch


 I found an execution in which leader election does not make progress. Here is 
 the problematic scenario:
 - We have an ensemble of 3 servers, and we start only 2;
 - We let them elect a leader, and then crash the one with lowest id, say S_1 
 (call the other S_2);
 - We restart the crashed server.
 Upon restarting S_1, S_2 has its logical clock more advanced, and S_1 has its 
 logical clock set to 1. Once S_1 receives a notification from S_2, it notices 
 that it is in the wrong round and it advances its logical clock to the same 
 value as S_1. Now, the problem comes exactly in this point because in the 
 current code S_1 resets its vote to its initial vote (its own id and zxid). 
 Since S_2 has already notified S_1, it won't do it again, and we are stuck. 
 The patch I'm submitting fixes this problem by setting the vote of S_1 to the 
 one received if it satisfies the total order predicate (received zxid is 
 higher or received zxid is the same and received id is higher).
 Related to this problem, I noticed that by trying to avoid unnecessary 
 notification duplicates, there could be scenarios in which a server fails 
 before electing a leader and restarts before leader election succeeds. This 
 could happen, for example, when there isn't enough servers available and one 
 available crashes and restarts. I fixed this problem in the attached patch by 
 allowing a server to send a new batch of notifications if there is at least 
 one outgoing queue of pending notifications empty. This is ok because we 
 space out consecutive batches of notifications. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-275) Bug in FastLeaderElection

2009-01-29 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-275:


Attachment: ZOOKEEPER-275.patch

fixed some log.info to be log.debug and fixed the duplicated log messages as 
well.

 Bug in FastLeaderElection
 -

 Key: ZOOKEEPER-275
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-275
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.0.0, 3.0.1
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
 Fix For: 3.1.0

 Attachments: ZOOKEEPER-275.patch, ZOOKEEPER-275.patch, 
 ZOOKEEPER-275.patch


 I found an execution in which leader election does not make progress. Here is 
 the problematic scenario:
 - We have an ensemble of 3 servers, and we start only 2;
 - We let them elect a leader, and then crash the one with lowest id, say S_1 
 (call the other S_2);
 - We restart the crashed server.
 Upon restarting S_1, S_2 has its logical clock more advanced, and S_1 has its 
 logical clock set to 1. Once S_1 receives a notification from S_2, it notices 
 that it is in the wrong round and it advances its logical clock to the same 
 value as S_1. Now, the problem comes exactly in this point because in the 
 current code S_1 resets its vote to its initial vote (its own id and zxid). 
 Since S_2 has already notified S_1, it won't do it again, and we are stuck. 
 The patch I'm submitting fixes this problem by setting the vote of S_1 to the 
 one received if it satisfies the total order predicate (received zxid is 
 higher or received zxid is the same and received id is higher).
 Related to this problem, I noticed that by trying to avoid unnecessary 
 notification duplicates, there could be scenarios in which a server fails 
 before electing a leader and restarts before leader election succeeds. This 
 could happen, for example, when there isn't enough servers available and one 
 available crashes and restarts. I fixed this problem in the attached patch by 
 allowing a server to send a new batch of notifications if there is at least 
 one outgoing queue of pending notifications empty. This is ok because we 
 space out consecutive batches of notifications. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-275) Bug in FastLeaderElection

2009-01-29 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-275:


Attachment: ZOOKEEPER-275.patch

the last patches fail with compile-test since one of the test uses an old api.

 Bug in FastLeaderElection
 -

 Key: ZOOKEEPER-275
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-275
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.0.0, 3.0.1
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
 Fix For: 3.1.0

 Attachments: ZOOKEEPER-275.patch, ZOOKEEPER-275.patch, 
 ZOOKEEPER-275.patch, ZOOKEEPER-275.patch


 I found an execution in which leader election does not make progress. Here is 
 the problematic scenario:
 - We have an ensemble of 3 servers, and we start only 2;
 - We let them elect a leader, and then crash the one with lowest id, say S_1 
 (call the other S_2);
 - We restart the crashed server.
 Upon restarting S_1, S_2 has its logical clock more advanced, and S_1 has its 
 logical clock set to 1. Once S_1 receives a notification from S_2, it notices 
 that it is in the wrong round and it advances its logical clock to the same 
 value as S_1. Now, the problem comes exactly in this point because in the 
 current code S_1 resets its vote to its initial vote (its own id and zxid). 
 Since S_2 has already notified S_1, it won't do it again, and we are stuck. 
 The patch I'm submitting fixes this problem by setting the vote of S_1 to the 
 one received if it satisfies the total order predicate (received zxid is 
 higher or received zxid is the same and received id is higher).
 Related to this problem, I noticed that by trying to avoid unnecessary 
 notification duplicates, there could be scenarios in which a server fails 
 before electing a leader and restarts before leader election succeeds. This 
 could happen, for example, when there isn't enough servers available and one 
 available crashes and restarts. I fixed this problem in the attached patch by 
 allowing a server to send a new batch of notifications if there is at least 
 one outgoing queue of pending notifications empty. This is ok because we 
 space out consecutive batches of notifications. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-275) Bug in FastLeaderElection

2009-01-29 Thread Mahadev konar (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-275:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

I just committed this. Thanks flavio.

 Bug in FastLeaderElection
 -

 Key: ZOOKEEPER-275
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-275
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.0.0, 3.0.1
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
 Fix For: 3.1.0

 Attachments: ZOOKEEPER-275.patch, ZOOKEEPER-275.patch, 
 ZOOKEEPER-275.patch, ZOOKEEPER-275.patch


 I found an execution in which leader election does not make progress. Here is 
 the problematic scenario:
 - We have an ensemble of 3 servers, and we start only 2;
 - We let them elect a leader, and then crash the one with lowest id, say S_1 
 (call the other S_2);
 - We restart the crashed server.
 Upon restarting S_1, S_2 has its logical clock more advanced, and S_1 has its 
 logical clock set to 1. Once S_1 receives a notification from S_2, it notices 
 that it is in the wrong round and it advances its logical clock to the same 
 value as S_1. Now, the problem comes exactly in this point because in the 
 current code S_1 resets its vote to its initial vote (its own id and zxid). 
 Since S_2 has already notified S_1, it won't do it again, and we are stuck. 
 The patch I'm submitting fixes this problem by setting the vote of S_1 to the 
 one received if it satisfies the total order predicate (received zxid is 
 higher or received zxid is the same and received id is higher).
 Related to this problem, I noticed that by trying to avoid unnecessary 
 notification duplicates, there could be scenarios in which a server fails 
 before electing a leader and restarts before leader election succeeds. This 
 could happen, for example, when there isn't enough servers available and one 
 available crashes and restarts. I fixed this problem in the attached patch by 
 allowing a server to send a new batch of notifications if there is at least 
 one outgoing queue of pending notifications empty. This is ok because we 
 space out consecutive batches of notifications. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-275) Bug in FastLeaderElection

2009-01-15 Thread Flavio Paiva Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Paiva Junqueira updated ZOOKEEPER-275:
-

Attachment: ZOOKEEPER-275.patch

Patch for the problems described. Includes a unit test for the first case. The 
second case is difficult and I still don't have a good idea of how to write a 
unit test for it. In particular, I haven't been able to crash and restart a 
peer in a unit test because when I kill the listener of QuorumCnxManager and 
try to create another instance, it complains that the port is in use. I tried 
using setReuseAddress(true) before binding, but it still doesn't work. 

 Bug in FastLeaderElection
 -

 Key: ZOOKEEPER-275
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-275
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
 Attachments: ZOOKEEPER-275.patch


 I found an execution in which leader election does not make progress. Here is 
 the problematic scenario:
 - We have an ensemble of 3 servers, and we start only 2;
 - We let them elect a leader, and then crash the one with lowest id, say S_1 
 (call the other S_2);
 - We restart the crashed server.
 Upon restarting S_1, S_2 has its logical clock more advanced, and S_1 has its 
 logical clock set to 1. Once S_1 receives a notification from S_2, it notices 
 that it is in the wrong round and it advances its logical clock to the same 
 value as S_1. Now, the problem comes exactly in this point because in the 
 current code S_1 resets its vote to its initial vote (its own id and zxid). 
 Since S_2 has already notified S_1, it won't do it again, and we are stuck. 
 The patch I'm submitting fixes this problem by setting the vote of S_1 to the 
 one received if it satisfies the total order predicate (received zxid is 
 higher or received zxid is the same and received id is higher).
 Related to this problem, I noticed that by trying to avoid unnecessary 
 notification duplicates, there could be scenarios in which a server fails 
 before electing a leader and restarts before leader election succeeds. This 
 could happen, for example, when there isn't enough servers available and one 
 available crashes and restarts. I fixed this problem in the attached patch by 
 allowing a server to send a new batch of notifications if there is at least 
 one outgoing queue of pending notifications empty. This is ok because we 
 space out consecutive batches of notifications. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-275) Bug in FastLeaderElection

2009-01-15 Thread Flavio Paiva Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Paiva Junqueira updated ZOOKEEPER-275:
-

Fix Version/s: 3.1.0

 Bug in FastLeaderElection
 -

 Key: ZOOKEEPER-275
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-275
 Project: Zookeeper
  Issue Type: Bug
  Components: leaderElection
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
 Fix For: 3.1.0

 Attachments: ZOOKEEPER-275.patch


 I found an execution in which leader election does not make progress. Here is 
 the problematic scenario:
 - We have an ensemble of 3 servers, and we start only 2;
 - We let them elect a leader, and then crash the one with lowest id, say S_1 
 (call the other S_2);
 - We restart the crashed server.
 Upon restarting S_1, S_2 has its logical clock more advanced, and S_1 has its 
 logical clock set to 1. Once S_1 receives a notification from S_2, it notices 
 that it is in the wrong round and it advances its logical clock to the same 
 value as S_1. Now, the problem comes exactly in this point because in the 
 current code S_1 resets its vote to its initial vote (its own id and zxid). 
 Since S_2 has already notified S_1, it won't do it again, and we are stuck. 
 The patch I'm submitting fixes this problem by setting the vote of S_1 to the 
 one received if it satisfies the total order predicate (received zxid is 
 higher or received zxid is the same and received id is higher).
 Related to this problem, I noticed that by trying to avoid unnecessary 
 notification duplicates, there could be scenarios in which a server fails 
 before electing a leader and restarts before leader election succeeds. This 
 could happen, for example, when there isn't enough servers available and one 
 available crashes and restarts. I fixed this problem in the attached patch by 
 allowing a server to send a new batch of notifications if there is at least 
 one outgoing queue of pending notifications empty. This is ok because we 
 space out consecutive batches of notifications. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.