Flavio Paiva Junqueira updated ZOOKEEPER-647:

    Attachment: ZOOKEEPER-647.patch

By inspecting the code of QuorumCnxManager, I've been able to find a corner 
case that could cause a RecvWorker thread to hang during shutdown. Here is a 
summary of how the problem can occur:

1- sendWorkerMap is updated during the execution of softHalt (cnx manager is 
being shut down);
2- The sender worker that was not shut down during the execution of softHalt 
will later leave its main loop because the value of the attribute shutdown is 
3- Leaving the loop due to shutdown evaluating to true does not cause finish() 
to be called, which must happen to kill its recv worker sibling.

I'm proposing a fix that is quite simple. The correct interleaving to trigger 
the bug is quite difficult to reproduce, though, so I'm not providing a test.

> hudson failure in testLeaderShutdown
> ------------------------------------
>                 Key: ZOOKEEPER-647
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-647
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: server
>            Reporter: Patrick Hunt
>            Assignee: Flavio Paiva Junqueira
>            Priority: Critical
>             Fix For: 3.3.0
>         Attachments: ZOOKEEPER-647.patch
> http://hudson.zones.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/666/testReport/org.apache.zookeeper.test/QuorumTest/testLeaderShutdown/
> junit.framework.AssertionFailedError: QP failed to shutdown in 30 seconds
>       at org.apache.zookeeper.test.QuorumBase.shutdown(QuorumBase.java:293)
>       at 
> org.apache.zookeeper.test.QuorumBase.shutdownServers(QuorumBase.java:281)
>       at org.apache.zookeeper.test.QuorumBase.tearDown(QuorumBase.java:266)
>       at org.apache.zookeeper.test.QuorumTest.tearDown(QuorumTest.java:55)
> Flavio, can you triage this one?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to