[ https://issues.apache.org/jira/browse/ZOOKEEPER-362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flavio Paiva Junqueira updated ZOOKEEPER-362: --------------------------------------------- Attachment: ZOOKEEPER-362.patch This patch fixes the problem in the description. More concretely, it does the following: 1- It synchronizes QuorumCnxManager::connectOne so that there are no competing connections to the same server; 2- It doesn't remove an existing connection in QuorumCnxManager::receiveConnection when winning the challenge; 3- it eliminates the second definition of "ss" in QuorumCnxManager::Listener. This was a pretty silly bug (my fault of course); 4- It adds a deadline to semapahores in FLENewEpochTest so that it doesn't wait indefinitely; 5- If thread 0 finishes before thread 1, then thread 1 initiates a new round after waiting for 1s. This is what happens in a real deployment as a follower gives up on its elected leader if the elected leader takes too long to acknowledge its leadership. As we don't run the follower/leader part of the code in this test, moving to the next round doesn't happen automatically. > Issues with FLENewEpochTest > --------------------------- > > Key: ZOOKEEPER-362 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-362 > Project: Zookeeper > Issue Type: Bug > Affects Versions: 3.1.1 > Reporter: Flavio Paiva Junqueira > Fix For: 3.2.0 > > Attachments: ZOOKEEPER-362.patch > > > I have been able to identify two reasons that cause FLENewEpochTest to fail: > 1- There is a race condition that is triggered when two peers try to > establish a connection to each other for leader election. Basically, if they > start roughly at the same time, the server with highest id will try to open > two connections. The two competing connections will lead to one notification > message to be lost. This message happens to be critical for this two process > scenario; > 2- The code to shut down a peer is not working well with the unit tests. For > this particular unit test, we need to be able to shut down a peer completely > to check the situation the test tries to reproduce. However, it seems that in > some runs timing causes the other peers to believe it is still alive, and end > up electing it. This peer, however, eventually shuts down and leader election > fails. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.