Re: Zookeeper outage recap questions

2010-07-01 Thread Flavio Junqueira
Hi Travis, Do you think it would be possible for you to open a jira and upload your logs?Thanks,-FlavioOn Jul 1, 2010, at 8:13 AM, Travis Crawford wrote:Hey zookeepers -We just experienced a total zookeeper outage, and here's a quickpost-mortem of the issue, and some questions about preventing it goingforward. Quick overview of the setup:- RHEL5 2.6.18 kernel- Zookeeper 3.3.0- ulimit raised to 65k files- 3 cluster members- 4-5k connections in steady-state- Primarily C and python clients, plus some javaIn chronological order, the issue manifested itself as alert about RWtests failing. Logs were full of too many files errors, and the outputof netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was100%. Application logs showed lots of connection timeouts. Thissuggests an event happened that caused applications to dogpile onZookeeper, and eventually the CLOSE_WAIT timeout caused file handlesto run out and basically game over.I looked through lots of logs (clients+servers) and did not see aclear indication of what happened. Graphs show a sudden decrease innetwork traffic when the outage began, zookeeper goes cpu bound, andruns our of file descriptors.Clients are primarily a couple thousand C clients using defaultconnection parameters, and a couple thousand python clients usingdefault connection parameters.Digging through Jira we see two issues that probably contributed to this outage: https://issues.apache.org/jira/browse/ZOOKEEPER-662 https://issues.apache.org/jira/browse/ZOOKEEPER-517Both are tagged for the 3.4.0 release. Anyone know if that's still thecase, and when 3.4.0 is roughly scheduled to ship?Thanks!Travis flaviojunqueiraresearch scientistf...@yahoo-inc.comdirect +34 93-183-8828avinguda diagonal 177, 8th floor, barcelona, 08018, esphone (408) 349 3300fax (408) 349 3301 

Re: Zookeeper outage recap questions

2010-07-01 Thread Patrick Hunt
Hi Travis, as Flavio suggested would be great to get the logs. A few 
questions:


1) how did you eventually recover, restart the zk servers?

2) was the cluster losing quorum during this time? leader re-election?

3) Any chance this could have been initially triggered by a long GC 
pause on one of the servers? (is gc logging turned on, any sort of heap 
monitoring?) Has the GC been tuned on the servers, for example CMS and 
incremental?


4) what are the clients using for timeout on the sessions?

3.4 probably not for a few months yet, but we are planning for a 3.3.2 
in a few weeks to fix a couple critical issues (which don't seem related 
to what you saw). If we can identify the problem here we should be able 
to include it in any fix release we do.


fixing something like 517 might help, but it's not clear how we got to 
this state in the first place. fixing 517 might not have any effect if 
the root cause is not addressed. 662 has only ever been reported once 
afaik, and we weren't able to identify the root cause for that one.


One thing we might also consider is modifying the zk client lib to 
backoff connection attempts if they keep failing (timing out say). Today 
the clients are pretty aggressive on reconnection attempts. Having some 
sort of backoff (exponential?) would provide more breathing room to the 
server in this situation.


Patrick

On 06/30/2010 11:13 PM, Travis Crawford wrote:

Hey zookeepers -

We just experienced a total zookeeper outage, and here's a quick
post-mortem of the issue, and some questions about preventing it going
forward. Quick overview of the setup:

- RHEL5 2.6.18 kernel
- Zookeeper 3.3.0
- ulimit raised to 65k files
- 3 cluster members
- 4-5k connections in steady-state
- Primarily C and python clients, plus some java

In chronological order, the issue manifested itself as alert about RW
tests failing. Logs were full of too many files errors, and the output
of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was
100%. Application logs showed lots of connection timeouts. This
suggests an event happened that caused applications to dogpile on
Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles
to run out and basically game over.

I looked through lots of logs (clients+servers) and did not see a
clear indication of what happened. Graphs show a sudden decrease in
network traffic when the outage began, zookeeper goes cpu bound, and
runs our of file descriptors.

Clients are primarily a couple thousand C clients using default
connection parameters, and a couple thousand python clients using
default connection parameters.

Digging through Jira we see two issues that probably contributed to this outage:

 https://issues.apache.org/jira/browse/ZOOKEEPER-662
 https://issues.apache.org/jira/browse/ZOOKEEPER-517

Both are tagged for the 3.4.0 release. Anyone know if that's still the
case, and when 3.4.0 is roughly scheduled to ship?

Thanks!
Travis


Re: Zookeeper outage recap questions

2010-07-01 Thread Travis Crawford
I've moved this thread to:

https://issues.apache.org/jira/browse/ZOOKEEPER-801

--travis


On Thu, Jul 1, 2010 at 12:37 AM, Patrick Hunt ph...@apache.org wrote:
 Hi Travis, as Flavio suggested would be great to get the logs. A few
 questions:

 1) how did you eventually recover, restart the zk servers?

 2) was the cluster losing quorum during this time? leader re-election?

 3) Any chance this could have been initially triggered by a long GC pause on
 one of the servers? (is gc logging turned on, any sort of heap monitoring?)
 Has the GC been tuned on the servers, for example CMS and incremental?

 4) what are the clients using for timeout on the sessions?

 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a
 few weeks to fix a couple critical issues (which don't seem related to what
 you saw). If we can identify the problem here we should be able to include
 it in any fix release we do.

 fixing something like 517 might help, but it's not clear how we got to this
 state in the first place. fixing 517 might not have any effect if the root
 cause is not addressed. 662 has only ever been reported once afaik, and we
 weren't able to identify the root cause for that one.

 One thing we might also consider is modifying the zk client lib to backoff
 connection attempts if they keep failing (timing out say). Today the clients
 are pretty aggressive on reconnection attempts. Having some sort of backoff
 (exponential?) would provide more breathing room to the server in this
 situation.

 Patrick

 On 06/30/2010 11:13 PM, Travis Crawford wrote:

 Hey zookeepers -

 We just experienced a total zookeeper outage, and here's a quick
 post-mortem of the issue, and some questions about preventing it going
 forward. Quick overview of the setup:

 - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java

 In chronological order, the issue manifested itself as alert about RW
 tests failing. Logs were full of too many files errors, and the output
 of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was
 100%. Application logs showed lots of connection timeouts. This
 suggests an event happened that caused applications to dogpile on
 Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles
 to run out and basically game over.

 I looked through lots of logs (clients+servers) and did not see a
 clear indication of what happened. Graphs show a sudden decrease in
 network traffic when the outage began, zookeeper goes cpu bound, and
 runs our of file descriptors.

 Clients are primarily a couple thousand C clients using default
 connection parameters, and a couple thousand python clients using
 default connection parameters.

 Digging through Jira we see two issues that probably contributed to this
 outage:

     https://issues.apache.org/jira/browse/ZOOKEEPER-662
     https://issues.apache.org/jira/browse/ZOOKEEPER-517

 Both are tagged for the 3.4.0 release. Anyone know if that's still the
 case, and when 3.4.0 is roughly scheduled to ship?

 Thanks!
 Travis