[ https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929681#action_12929681 ]
Vishal K commented on ZOOKEEPER-914: ------------------------------------ Hi Flavio, The documentation is not clear. SO_TIMEOUT has not effect on blocking channels. Non-blocking channels, wait for the specified timeout if nothing is available in the buffer. Otherwise, it returns whatever bytes are currently available in the buffer. I wrote a test the following test to verify this. Let me know if you know about way to make SO_TIMEOUT to work. QuorumPeer peerLeader = new QuorumPeer(peers, tmpdir, tmpdir, port, 3, 0, 2, 2, 2); QuorumCnxManager cnxManager = new QuorumCnxManager(peerLeader); QuorumCnxManager.Listener listener = cnxManager.listener; SocketChannel channel = SocketChannel.open(); channel.socket().connect(peers.get(new Long(0)).electionAddr, 5000); channel.configureBlocking(false); channel.socket().setSoTimeout(1000); byte msgBytes = new byte; ByteBuffer msgBuffer = ByteBuffer.wrap(msgBytes); /** * Don't send any data and call read() and see how long it waits. */ long begin = System.currentTimeMillis(); channel.read(msgBuffer); long end = System.currentTimeMillis(); Feel to free close duplicate bugs. > QuorumCnxManager blocks forever > -------------------------------- > > Key: ZOOKEEPER-914 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914 > Project: Zookeeper > Issue Type: Bug > Components: leaderElection > Reporter: Vishal K > Assignee: Vishal K > Priority: Blocker > Fix For: 3.3.3, 3.4.0 > > > This was a disaster. While testing our application we ran into a scenario > where a rebooted follower could not join the cluster. Further debugging > showed that the follower could not join because the QuorumCnxManager on the > leader was blocked for indefinite amount of time in receiveConnect() > "Thread-3" prio=10 tid=0x00007fa920005800 nid=0x11bb runnable > [0x00007fa9275ed000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) > at sun.nio.ch.IOUtil.read(IOUtil.java:206) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) > - locked <0x00007fa93315f988> (a java.lang.Object) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501) > I had pointed out this bug along with several other problems in > QuorumCnxManager earlier in > https://issues.apache.org/jira/browse/ZOOKEEPER-900 and > https://issues.apache.org/jira/browse/ZOOKEEPER-822. > I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix > and a patch will be out soon. > The problem is that QuorumCnxManager is using SocketChannel in blocking mode. > It does a read() in receiveConnection() and a write() in initiateConnection(). > Sorry, but this is really bad programming. Also, points out to lack of > failure tests for QuorumCnxManager. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.