Jean-Daniel Cryans commented on ZOOKEEPER-880:

bq. to be overly clear - this is happening on just 1 server, the other servers 
on the cluster are not seeing this, is that right?

Yes, sv4borg9.

bq. any insight on GC and JVM activity. Are there significant pauses on the GC, 
or perhaps swapping of that jvm? How active is the JVM? How active (cpu) are 
the other processes on this host? You mentioned they are using 50% disk, what 
about cpu?

No swapping, GC activity is normal as far as I can tell by the GC log, 1 active 
CPU for that process according to top (the rest of the cpus are idle most of 
the time).  

bq. If I understood correctly the JVM hosting the ZK server is hosting other 
code as well, is that right? You mentioned something about hbase managing the 
ZK server, could you elaborate on that as well?

That machine is also the Namenode, JobTracker and HBase master (all in their 
own JVMs). The only thing special is that the quorum peers are started by HBase.

bq. Is there a way you could move the ZK datadir on that host to an unused 
spindle and see if that helps at all?

I'll look into that.

> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
> hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack
> We're seeing an issue where one server in the ensemble has a steady growing 
> number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
> out of native threads, and at the same time we see a lot of exceptions in the 
> logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs 
> in moment.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to