[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-11-22 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934700#action_12934700
 ] 

Jean-Daniel Cryans commented on ZOOKEEPER-880:
--

+1 It fixes the leak. Thanks!

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
Assignee: Vishal K
Priority: Critical
 Fix For: 3.3.3, 3.4.0

 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz, ZOOKEEPER-880.patch


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-11-19 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933718#action_12933718
 ] 

Flavio Junqueira commented on ZOOKEEPER-880:


One problem here is that we had some discussions over IRC and the information 
is not reflected here. 

If you have a look at the logs, you'll observe this:

{noformat}

2010-09-28 10:31:22,227 DEBUG 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection request 
/10.10.20.5:41861
2010-09-28 10:31:22,227 DEBUG 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection request: 0
2010-09-28 10:31:22,227 DEBUG 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Address of remote peer: 0
2010-09-28 10:31:22,229 WARN 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection broken:
java.io.IOException: Channel eof
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:595)
{noformat}

If I remember the discussion with J-D correctly, that node trying to connect is 
running Nagios. My conjecture at the time was that the IOException was killing 
the receiver thread, but not the sender thread (RecvWorker.finish() does not 
close its SendWorker counterpart).

Your point is good, but it sounds like that the race you mention would have to 
be triggered continuously to cause the number of SendWorker threads to grow 
steadily. It sounds unlikely to me.

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
Priority: Critical
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-11-19 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933829#action_12933829
 ] 

Vishal K commented on ZOOKEEPER-880:


Hi Flavio,

You are right.  We can see RecvWorker leaving but no messages from SendWorker.

2010-09-27 16:02:59,111 WARN 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection broken:
java.io.IOException: Channel eof
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:595)
2010-09-27 16:02:59,162 WARN 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection broken:
java.io.IOException: Channel eof
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:595)
2010-09-27 16:03:14,269 WARN 
org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection broken:
java.io.IOException: Channel eof
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:595)

I thought that RecvWorker in  3.3.1 called sw.finish() before exiting. Adding 
this call in RecvWorker should fix this problem.


-Vishal

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
Priority: Critical
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-11-19 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933842#action_12933842
 ] 

Vishal K commented on ZOOKEEPER-880:


Leader has the same problem as well. LearnerHandler expects a QuorumPacket to 
be received as the first packet after connection. However, if Nagios was 
monitoring server port as well, then one would expect to see a lot of such 
messages:
LOG.error(First packet  + qp.toString()
+  is not FOLLOWERINFO or OBSERVERINFO!);

Is Nagios not monitoring the server port?


 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
Priority: Critical
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-11-19 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933871#action_12933871
 ] 

Jean-Daniel Cryans commented on ZOOKEEPER-880:
--

Nagios is monitoring all 5 ensemble members the exact same way (checking 
connectivity on all 3 ports), although only 1 machine shows the issue. We tried 
stopping the monitoring on the problematic machine, but still got a growing 
number of threads.

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
Priority: Critical
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-11-19 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933891#action_12933891
 ] 

Patrick Hunt commented on ZOOKEEPER-880:


Flavio (and others) we should update the docs to include details on which ports 
can/should be monitored, and which ports should NOT be monitored (or if 
monitoring is supported any conditions).

Can we update the docs as part of any patch/fix? Thanks.

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
Priority: Critical
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-11-19 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933897#action_12933897
 ] 

Benoit Sigoure commented on ZOOKEEPER-880:
--

Do we agree that monitoring wasn't causing the issue?  As JD said, even after 
we stopped it, the problem re-occurred.

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
Priority: Critical
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-11-19 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933928#action_12933928
 ] 

Flavio Junqueira commented on ZOOKEEPER-880:


I think we agree that monitoring alone was not causing the issue. But, your 
logs indicate that there were some orphan threads due to the monitoring, and we 
can see it from excerpts of your logs like the one I posted above. Without the 
monitoring, the same problem is being triggered, though, but apparently in a 
different way and it is not clear why. You can see it from all the Channel 
eof messages on the log. 

To solve this issue, we need to understand the following:

# What's causing those IOExceptions?
# Why are we even starting a new connection if there is no leader election 
going on? 

Do you folks have any idea if there is anything in your environment that could 
be causing those TCP connections to break? 

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
Priority: Critical
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-11-18 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933602#action_12933602
 ] 

Vishal K commented on ZOOKEEPER-880:


Hi Benoit,

May I suggest to see if you can reproduce this problem with 3.3.3
(with patch for ZOOKEEPER-822)? I was going through
QuorumCnxManager.java for 3.2.2. It clearly leaks a SendWorker thread
for every other connection.

After receiving a connection from a peer, it creates a new thread and
inserts its reference in senderWorkerMap.

SendWorker sw = new SendWorker(s, sid);
RecvWorker rw = new RecvWorker(s, sid);
sw.setRecv(rw);

SendWorker vsw = senderWorkerMap.get(sid);
senderWorkerMap.put(sid, sw);

Then it kills the old thread for the peer (created from earlier
connection)

if(vsw != null)
vsw.finish();

However, the SendWorker.finish method removes an entry from
senderWorkerMap. This results in removing a reference for
recently created SendWorker thread.
senderWorkerMap.remove(sid);


Thus, it will end up removing both the entries. As a result, one thread
will be leaked for every other connection.

If you count the number of error messages in
hbase-hadoop-zookeeper-sv4borg9.log, you will see that messages from
RecvWorker is approximately twice of SendWorker. I think this proves
the point.

$:/tmp/hadoop # grep RecvWorker  hbase-hadoop-zookeeper-sv4borg9.log | wc -l
60
$:/tmp/hadoop # grep SendWorker  hbase-hadoop-zookeeper-sv4borg9.log | wc -l
32

-Vishal

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
Priority: Critical
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-11-12 Thread Benoit Sigoure (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931457#action_12931457
 ] 

Benoit Sigoure commented on ZOOKEEPER-880:
--

Bumping up the severity.  This took down one of our clusters again.

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
Priority: Critical
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-11-12 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931464#action_12931464
 ] 

Flavio Junqueira commented on ZOOKEEPER-880:


Benoit, just to clarify, is this also due to monitoring or scanning?

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
Priority: Critical
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-10-04 Thread Vishal K (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917685#action_12917685
 ] 

Vishal K commented on ZOOKEEPER-880:


While debugging for https://issues.apache.org/jira/browse/ZOOKEEPER-822 I found 
that senderWorkerMap would not have an entry for a server, but there will be a 
RecvWorker and SendWorker thread running for the server. In my case, this was 
seen when the leader died (i.e., during leader election). However, I think this 
can happen when a peer disconnects from another peer. The cause was incorrect 
handling of add/remove of entries from senderWorkerMap, which is exposed due to 
race conditions in QuorumCnxManager. There is a patch available for 
ZOOKEEPER-822.

I am not sure if the ZOOKEEPER-822 is causing trouble here as well. I just 
wanted to point out the possibility.


 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-30 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916544#action_12916544
 ] 

Patrick Hunt commented on ZOOKEEPER-880:


I tried reproducing this with 5 servers on my laptop and using check_tcp that I 
compiled. I can't get it to happen. However I do notice a large number of 
connections (from nagios to election port) in the SYN_RECV state (some in 
close_wait as well).

JD - can you turn off nagios, restart the effected server (clear the log4j logs 
as well), then look to see if the problem is still occuring? update this jira 
with the logs from the bad server and one other (at least) would be helpful. 
Thanks!

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-30 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916546#action_12916546
 ] 

Patrick Hunt commented on ZOOKEEPER-880:


This issue sounds very similar to ZOOKEEPER-883 -- they are also using Nagios 
to monitor the election port in that case as well.

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-28 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915634#action_12915634
 ] 

Patrick Hunt commented on ZOOKEEPER-880:


bq. one server in the ensemble has a steady growing number of 
QuorumCnxManager$SendWorker threads 

to be overly clear - this is happening on just 1 server, the other servers on 
the cluster are not seeing this, is that right?

any insight on GC and JVM activity. Are there significant pauses on the GC, or 
perhaps swapping of that jvm?  How active is the JVM? How active (cpu) are the 
other processes on this host? You mentioned they are using 50% disk, what about 
cpu? 

Is there a way you could move the ZK datadir on that host to an unused spindle 
and see if that helps at all?

If I understood correctly the JVM hosting the ZK server is hosting other code 
as well, is that right? You mentioned something about hbase managing the ZK 
server, could you elaborate on that as well?



 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-28 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915832#action_12915832
 ] 

Jean-Daniel Cryans commented on ZOOKEEPER-880:
--

bq. to be overly clear - this is happening on just 1 server, the other servers 
on the cluster are not seeing this, is that right?

Yes, sv4borg9.

bq. any insight on GC and JVM activity. Are there significant pauses on the GC, 
or perhaps swapping of that jvm? How active is the JVM? How active (cpu) are 
the other processes on this host? You mentioned they are using 50% disk, what 
about cpu?

No swapping, GC activity is normal as far as I can tell by the GC log, 1 active 
CPU for that process according to top (the rest of the cpus are idle most of 
the time).  

bq. If I understood correctly the JVM hosting the ZK server is hosting other 
code as well, is that right? You mentioned something about hbase managing the 
ZK server, could you elaborate on that as well?

That machine is also the Namenode, JobTracker and HBase master (all in their 
own JVMs). The only thing special is that the quorum peers are started by HBase.

bq. Is there a way you could move the ZK datadir on that host to an unused 
spindle and see if that helps at all?

I'll look into that.

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-28 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915838#action_12915838
 ] 

Patrick Hunt commented on ZOOKEEPER-880:


JD tried moving the datadirectory to another disk (new datadir), that didn't 
help, same problem. Also note: the snapshot file is ~2mb in size.

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-28 Thread Benjamin Reed (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915849#action_12915849
 ] 

Benjamin Reed commented on ZOOKEEPER-880:
-

is there an easy way to reproduce this?

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-28 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915850#action_12915850
 ] 

Jean-Daniel Cryans commented on ZOOKEEPER-880:
--

bq. is there an easy way to reproduce this?

Unfortunately none I can see... we have 5 clusters that use the same hardware 
and ZK configurations and we only find this issue on this cluster, on this 
specific node, although all the other nodes of that cluster have the same 
InterruptedExceptions (but aren't leaking SendWorkers).

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, 
 TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-27 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915553#action_12915553
 ] 

Jean-Daniel Cryans commented on ZOOKEEPER-880:
--

Also we checked dmesg, syslog, df, ifconfig, ethtool, and ifstat for anything 
unusual and nothing obvious comes out.

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-27 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915559#action_12915559
 ] 

Jean-Daniel Cryans commented on ZOOKEEPER-880:
--

Oh but it looks like the process on sv4borg9 is generating a lot of IO (50% on 
one CPU on average according to top). Seems to be mostly writes.

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-27 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915625#action_12915625
 ] 

Flavio Junqueira commented on ZOOKEEPER-880:


J-D, Has it happened just once or it is reproducible? Does it also happen with 
3.3?

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

2010-09-27 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915627#action_12915627
 ] 

Jean-Daniel Cryans commented on ZOOKEEPER-880:
--

It's currently happening (currently at 344 threads and growing, was at 76 when 
I created this jira), it's still there when I bounce the peer, and it does 
happen on 3.3.1 too.

 QuorumCnxManager$SendWorker grows without bounds
 

 Key: ZOOKEEPER-880
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.2.2
Reporter: Jean-Daniel Cryans
 Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, 
 hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack


 We're seeing an issue where one server in the ensemble has a steady growing 
 number of QuorumCnxManager$SendWorker threads up to a point where the OS runs 
 out of native threads, and at the same time we see a lot of exceptions in the 
 logs.  This is on 3.2.2 and our config looks like:
 {noformat}
 tickTime=3000
 dataDir=/somewhere_thats_not_tmp
 clientPort=2181
 initLimit=10
 syncLimit=5
 server.0=sv4borg9:2888:3888
 server.1=sv4borg10:2888:3888
 server.2=sv4borg11:2888:3888
 server.3=sv4borg12:2888:3888
 server.4=sv4borg13:2888:3888
 {noformat}
 The issue is on the first server. I'm going to attach threads dumps and logs 
 in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.