Flavio Junqueira commented on ZOOKEEPER-880:

One problem here is that we had some discussions over IRC and the information 
is not reflected here. 

If you have a look at the logs, you'll observe this:


2010-09-28 10:31:22,227 DEBUG 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection request 
2010-09-28 10:31:22,227 DEBUG 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection request: 0
2010-09-28 10:31:22,227 DEBUG 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Address of remote peer: 0
2010-09-28 10:31:22,229 WARN 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection broken:
java.io.IOException: Channel eof

If I remember the discussion with J-D correctly, that node trying to connect is 
running Nagios. My conjecture at the time was that the IOException was killing 
the receiver thread, but not the sender thread (RecvWorker.finish() does not 
close its SendWorker counterpart).

Your point is good, but it sounds like that the race you mention would have to 
be triggered continuously to cause the number of SendWorker threads to grow 
steadily. It sounds unlikely to me.

> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
> hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
> TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz
> We're seeing an issue where one server in the ensemble has a steady growing 
> number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
> out of native threads, and at the same time we see a lot of exceptions in the 
> logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs 
> in moment.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to