Hi Travis, Do you think it would be possible for you to open a jira and upload your logs?

Thanks,
-Flavio

On Jul 1, 2010, at 8:13 AM, Travis Crawford wrote:

Hey zookeepers -

We just experienced a total zookeeper outage, and here's a quick
post-mortem of the issue, and some questions about preventing it going
forward. Quick overview of the setup:

- RHEL5 2.6.18 kernel
- Zookeeper 3.3.0
- ulimit raised to 65k files
- 3 cluster members
- 4-5k connections in steady-state
- Primarily C and python clients, plus some java

In chronological order, the issue manifested itself as alert about RW
tests failing. Logs were full of too many files errors, and the output
of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was
100%. Application logs showed lots of connection timeouts. This
suggests an event happened that caused applications to dogpile on
Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles
to run out and basically game over.

I looked through lots of logs (clients+servers) and did not see a
clear indication of what happened. Graphs show a sudden decrease in
network traffic when the outage began, zookeeper goes cpu bound, and
runs our of file descriptors.

Clients are primarily a couple thousand C clients using default
connection parameters, and a couple thousand python clients using
default connection parameters.

Digging through Jira we see two issues that probably contributed to this outage:

   https://issues.apache.org/jira/browse/ZOOKEEPER-662
   https://issues.apache.org/jira/browse/ZOOKEEPER-517

Both are tagged for the 3.4.0 release. Anyone know if that's still the
case, and when 3.4.0 is roughly scheduled to ship?

Thanks!
Travis

flavio
junqueira
 
research scientist
 
f...@yahoo-inc.com
direct +34 93-183-8828
 
avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300    fax (408) 349 3301


Reply via email to