[
https://issues.apache.org/jira/browse/ZOOKEEPER-362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flavio Paiva Junqueira updated ZOOKEEPER-362:
---------------------------------------------
Attachment: ZOOKEEPER-362.patch
This patch fixes the problem in the description. More concretely, it does the
following:
1- It synchronizes QuorumCnxManager::connectOne so that there are no competing
connections to the same server;
2- It doesn't remove an existing connection in
QuorumCnxManager::receiveConnection when winning the challenge;
3- it eliminates the second definition of "ss" in QuorumCnxManager::Listener.
This was a pretty silly bug (my fault of course);
4- It adds a deadline to semapahores in FLENewEpochTest so that it doesn't wait
indefinitely;
5- If thread 0 finishes before thread 1, then thread 1 initiates a new round
after waiting for 1s. This is what happens in a real deployment as a follower
gives up on its elected leader if the elected leader takes too long to
acknowledge its leadership. As we don't run the follower/leader part of the
code in this test, moving to the next round doesn't happen automatically.
> Issues with FLENewEpochTest
> ---------------------------
>
> Key: ZOOKEEPER-362
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-362
> Project: Zookeeper
> Issue Type: Bug
> Affects Versions: 3.1.1
> Reporter: Flavio Paiva Junqueira
> Fix For: 3.2.0
>
> Attachments: ZOOKEEPER-362.patch
>
>
> I have been able to identify two reasons that cause FLENewEpochTest to fail:
> 1- There is a race condition that is triggered when two peers try to
> establish a connection to each other for leader election. Basically, if they
> start roughly at the same time, the server with highest id will try to open
> two connections. The two competing connections will lead to one notification
> message to be lost. This message happens to be critical for this two process
> scenario;
> 2- The code to shut down a peer is not working well with the unit tests. For
> this particular unit test, we need to be able to shut down a peer completely
> to check the situation the test tries to reproduce. However, it seems that in
> some runs timing causes the other peers to believe it is still alive, and end
> up electing it. This peer, however, eventually shuts down and leader election
> fails.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.