Flavio Junqueira commented on ZOOKEEPER-914:
Hi Vishal, The Socket documentation does sound ambiguous, but my understanding
is that SO_TIMEOUT is for blocking mode, not non-blocking mode. Non-blocking
calls return immediately, so they shouldn't need a timeout value, no?
Independent of using it or not, I would be curious to learn if my understanding
About the release to include the fix, I think Mahdev later came and changed it
to 3.3.3. It is fine with me, and we just need to check what the schedule for
3.3.3 is. My preference is to work directly on ZOOKEEPER-900 (or 901, which I
think might be a more significant change), if you think we can produce a patch
in time for 3.3.3.
> QuorumCnxManager blocks forever
> Key: ZOOKEEPER-914
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
> Project: Zookeeper
> Issue Type: Bug
> Components: leaderElection
> Reporter: Vishal K
> Assignee: Vishal K
> Priority: Blocker
> Fix For: 3.3.3, 3.4.0
> This was a disaster. While testing our application we ran into a scenario
> where a rebooted follower could not join the cluster. Further debugging
> showed that the follower could not join because the QuorumCnxManager on the
> leader was blocked for indefinite amount of time in receiveConnect()
> "Thread-3" prio=10 tid=0x00007fa920005800 nid=0x11bb runnable
> java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
> at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
> - locked <0x00007fa93315f988> (a java.lang.Object)
> I had pointed out this bug along with several other problems in
> QuorumCnxManager earlier in
> https://issues.apache.org/jira/browse/ZOOKEEPER-900 and
> I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix
> and a patch will be out soon.
> The problem is that QuorumCnxManager is using SocketChannel in blocking mode.
> It does a read() in receiveConnection() and a write() in initiateConnection().
> Sorry, but this is really bad programming. Also, points out to lack of
> failure tests for QuorumCnxManager.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.