Patrick Hunt commented on ZOOKEEPER-914:
Hi Vishal we do appreciate your feedback and interest. You've been doing a
great job highlighting issues and working to resolve them. Again, thanks.
We also feel your frustrations. We wish we had unlimited time and resources to
develop and test ZK, unfortunately that's not the case. This is one of the many
reasons why we brought the project to Apache, to build community and gain
insights of developers and users such as yourself. Is everything "done", is it
all "perfect" code? No. However the source is open, the process is open, and we
hope that more contributors will sign on to working together and making
significant contributions. This doesn't have to be just new features, it very
much could be testing (code and QA), documentation and all the other bits that
go into useful software.
I encourage you to bring your QA related concerns to the larger group. That's
something that should be discussed on the dev list rather than here in a jira
for a specific issue. As you can see the primary committers work hard to
address all the issues found. However there's just not enough of us (and we
ourselves work on this in our spare time to varying degrees). Perhaps others
will feel similarly and you can work to address some of the deficiencies. I'd
*love* to see more unit test and more system testing. If you want to make that
happen I'd do my best to support you.
Regards. (I'll let Flavio comment on the further specifics of this particular
> QuorumCnxManager blocks forever
> Key: ZOOKEEPER-914
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
> Project: Zookeeper
> Issue Type: Bug
> Components: leaderElection
> Reporter: Vishal K
> Assignee: Vishal K
> Priority: Blocker
> Fix For: 3.3.3, 3.4.0
> This was a disaster. While testing our application we ran into a scenario
> where a rebooted follower could not join the cluster. Further debugging
> showed that the follower could not join because the QuorumCnxManager on the
> leader was blocked for indefinite amount of time in receiveConnect()
> "Thread-3" prio=10 tid=0x00007fa920005800 nid=0x11bb runnable
> java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
> at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
> - locked <0x00007fa93315f988> (a java.lang.Object)
> I had pointed out this bug along with several other problems in
> QuorumCnxManager earlier in
> https://issues.apache.org/jira/browse/ZOOKEEPER-900 and
> I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix
> and a patch will be out soon.
> The problem is that QuorumCnxManager is using SocketChannel in blocking mode.
> It does a read() in receiveConnection() and a write() in initiateConnection().
> Sorry, but this is really bad programming. Also, points out to lack of
> failure tests for QuorumCnxManager.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.