[ https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929345#action_12929345 ]
Flavio Junqueira commented on ZOOKEEPER-914: -------------------------------------------- Hi Vishal, I also appreciate your contributions and your comments. I also understand your frustration when you find issues with the code, but think that it is possibly equally frustrating for the developer who thought that at least basic issues were covered, so please try to think that we don't introduce bugs on purpose (at least I don't) and our review process is not perfect. Regarding clover reports, we have agreed already that code coverage is not bulletproof, and in fact there has been several other metrics proposed in the scientific literature, but it does indicate that some call path including a give piece of code was exercised. It certainly doesn't measure more complex cases, like race conditions, crashes and so on. In fact, if you have a better way of measuring test coverage, I'd happy to hear about it. I'm not sure if you agree, but it seems to me that we should close this jira because the technical discussion here seems to be similar to the one of ZOOKEEPER-900. I'll try to address the concerns you raised regardless of what will happen to this jira: # My point about SO_TIMEOUT comes from here: http://download.oracle.com/javase/6/docs/api/java/net/Socket.html#setSoTimeout%28int%29 # I obviously prefer to go with real fixes instead of hacking, but we need to have release 3.3.2 out, and it sounded like introducing a configurable timeout would fix your problem until the next release; # About testing beyond the handshake, I'm not sure what you're proposing. If the blocking calls are part of the handshake and this is what is failing for you, then this is what we should target now, no? > QuorumCnxManager blocks forever > -------------------------------- > > Key: ZOOKEEPER-914 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914 > Project: Zookeeper > Issue Type: Bug > Components: leaderElection > Reporter: Vishal K > Assignee: Vishal K > Priority: Blocker > Fix For: 3.3.3, 3.4.0 > > > This was a disaster. While testing our application we ran into a scenario > where a rebooted follower could not join the cluster. Further debugging > showed that the follower could not join because the QuorumCnxManager on the > leader was blocked for indefinite amount of time in receiveConnect() > "Thread-3" prio=10 tid=0x00007fa920005800 nid=0x11bb runnable > [0x00007fa9275ed000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) > at sun.nio.ch.IOUtil.read(IOUtil.java:206) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) > - locked <0x00007fa93315f988> (a java.lang.Object) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501) > I had pointed out this bug along with several other problems in > QuorumCnxManager earlier in > https://issues.apache.org/jira/browse/ZOOKEEPER-900 and > https://issues.apache.org/jira/browse/ZOOKEEPER-822. > I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix > and a patch will be out soon. > The problem is that QuorumCnxManager is using SocketChannel in blocking mode. > It does a read() in receiveConnection() and a write() in initiateConnection(). > Sorry, but this is really bad programming. Also, points out to lack of > failure tests for QuorumCnxManager. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.