QuorumCnxManager blocks forever 
--------------------------------

                 Key: ZOOKEEPER-914
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
             Project: Zookeeper
          Issue Type: Bug
            Reporter: Vishal K
            Assignee: Vishal K
            Priority: Blocker


This was a disaster. While testing our application we ran into a scenario where 
a rebooted follower could not join the cluster. Further debugging showed that 
the follower could not join because the QuorumCnxManager on the leader was 
blocked for indefinite amount of time in receiveConnect()

"Thread-3" prio=10 tid=0x00007fa920005800 nid=0x11bb runnable 
[0x00007fa9275ed000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
    at sun.nio.ch.IOUtil.read(IOUtil.java:206)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
    - locked <0x00007fa93315f988> (a java.lang.Object)
    at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
    at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)

I had pointed out this bug along with several other problems in 
QuorumCnxManager earlier in 
https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
https://issues.apache.org/jira/browse/ZOOKEEPER-822.

I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
and a patch will be out soon. 

The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
It does a read() in receiveConnection() and a write() in initiateConnection().

Sorry, but this is really bad programming. Also, points out to lack of failure 
tests for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to