Flavio Junqueira commented on ZOOKEEPER-914:

Hi Vishal, I also appreciate your contributions and your comments. I also 
understand your frustration when you find issues with the code, but think that 
it is possibly equally frustrating for the developer who thought that at least 
basic issues were covered, so please try to think that we don't introduce bugs 
on purpose (at least I don't) and our review process is not perfect. 

Regarding clover reports, we have agreed already that code coverage is not 
bulletproof, and in fact there has been several other metrics proposed in the 
scientific literature, but it does indicate that some call path including a 
give piece of code was exercised. It certainly doesn't measure more complex 
cases, like race conditions, crashes and so on. In fact, if you have a better 
way of measuring test coverage, I'd happy to hear about it.

I'm not sure if you agree, but it seems to me that we should close this jira 
because the technical discussion here seems to be similar to the one of 
ZOOKEEPER-900. I'll try to address the concerns you raised regardless of what 
will happen to this jira:

# My point about SO_TIMEOUT comes from here: 
# I obviously prefer to go with real fixes instead of hacking, but we need to 
have release 3.3.2 out, and it sounded like introducing a configurable timeout 
would fix your problem until the next release;
# About testing beyond the handshake, I'm not sure what you're proposing. If 
the blocking calls are part of the handshake and this is what is failing for 
you, then this is what we should target now, no?   

> QuorumCnxManager blocks forever 
> --------------------------------
>                 Key: ZOOKEEPER-914
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection
>            Reporter: Vishal K
>            Assignee: Vishal K
>            Priority: Blocker
>             Fix For: 3.3.3, 3.4.0
> This was a disaster. While testing our application we ran into a scenario 
> where a rebooted follower could not join the cluster. Further debugging 
> showed that the follower could not join because the QuorumCnxManager on the 
> leader was blocked for indefinite amount of time in receiveConnect()
> "Thread-3" prio=10 tid=0x00007fa920005800 nid=0x11bb runnable 
> [0x00007fa9275ed000]
>    java.lang.Thread.State: RUNNABLE
>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:206)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>     - locked <0x00007fa93315f988> (a java.lang.Object)
>     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
>     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
> I had pointed out this bug along with several other problems in 
> QuorumCnxManager earlier in 
> https://issues.apache.org/jira/browse/ZOOKEEPER-900 and 
> https://issues.apache.org/jira/browse/ZOOKEEPER-822.
> I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix 
> and a patch will be out soon. 
> The problem is that QuorumCnxManager is using SocketChannel in blocking mode. 
> It does a read() in receiveConnection() and a write() in initiateConnection().
> Sorry, but this is really bad programming. Also, points out to lack of 
> failure tests for QuorumCnxManager.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to