[jira] Updated: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-01 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford updated ZOOKEEPER-801:
--

Attachment: zookeeper-out-of-files.netstat

This attached netstat output shows a large number of: SYN_RECV CLOSE_WAIT. The 
fast connection churn appears to have consumed all available file handles.

> Zookeeper outage post-mortem
> 
>
> Key: ZOOKEEPER-801
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
> Project: Zookeeper
>  Issue Type: Improvement
>Affects Versions: 3.3.0
> Environment: - RHEL5 2.6.18 kernel
> - Zookeeper 3.3.0
> - ulimit raised to 65k files
> - 3 cluster members
> - 4-5k connections in steady-state
> - Primarily C and python clients, plus some java
>Reporter: Travis Crawford
> Attachments: client-errors.txt, zookeeper-out-of-files.netstat
>
>
> [Moving a thread from the zookeeper-user]
> RECOVERY
> We eventually recovered from this situation by shutting down clients. 
> Initially I tried restarting the Zookeepers, however, they were getting 
> hammered and I believe sessions timing out. I shut down ~2k clients 
> (lightweight python app; simply sets one data watch and takes an action when 
> it changes) at which point zktop could make a connection and a leader 
> election was verified. After resetting latency stats the numbers were very 
> good. I do not believe it would have ever recovered without shedding load.
> QUORUM/ELECTIONS DURING EVENT
> Unfortunately I do not have logs from the event :( We had debug logging on, 
> and logrotate configured to keep 10 100MB files, and the interesting parts 
> rotated away. I have already switched to info logging so we don't lose this 
> data again.
> During the incident I was not able to view cluster status with zktop, and 
> never observed a successful operation beyond connections, which quickly timed 
> out.
> GC PAUSE/LOGGING
> This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
> the defaults are in the start scripts. GC logging is not enabled either. I 
> filed an internal bug against myself to enable logging, and tune GC.
> CLIENT SESSION TIMEOUTS
> Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
> 10 seconds based on this log entry when initially connecting.
> 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
> Initiating client connection, 
> host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
> sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd= 
> context=(nil) flags=0
> CLIENT BACKOFFS
> Looking in application logs, we see lots of the following:
> 2010-07-01 
> 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
> [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
> connection timed out (exceeded timeout by 0ms)
> Doing some simple aggregations we see 130 errors in a ~13 minute sample 
> period. This behavior on thousands of clients appears to have been a DDoS 
> attack against Zookeeper. Using exponential behavior as the default behavior 
> seems appropriate looking at this data.
> Fulltext of the client errors is attached. I grepped "errno" from a Python 
> client log; I believe it uses the same underlying C library, so I did not 
> include example output from a C program (though I can if needed). It looks 
> basically the same.
> GOING FORWARD
> The long-GC pause causing clients to dogpile sounds like the most plausible 
> explanation at this time. GC logging/tuning is clearly where I dropped the 
> ball, just using the defaults; I don't think any changes should be made 
> related to lack of tuning.
> Exponential backoffs does seem like a good idea, and generally useful for 
> most people. There will always be service interruptions and backoffs would be 
> a great preventive measure to get out of a dogpile situation.
> Patrick's message:
> """
> Hi Travis, as Flavio suggested would be great to get the logs. A few 
> questions:
> 1) how did you eventually recover, restart the zk servers?
> 2) was the cluster losing quorum during this time? leader re-election?
> 3) Any chance this could have been initially triggered by a long GC pause on 
> one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
> Has the GC been tuned on the servers, for example CMS and incremental?
> 4) what are the clients using for timeout on the sessions?
> 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a 
> few weeks to fix a couple critical issues (which don't seem related to what 
> you saw). If we can identify the problem here we should be able to include it 
> in any fix release we do.
> fixing something like 517 might help, but it's not clear how we got to this 
> state in the first place. fixing 517 might not have any effect if t

[jira] Updated: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-01 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford updated ZOOKEEPER-801:
--

Attachment: client-errors.txt

client-errors.txt contains the zookeeper error messages during an outage. The 
interesting part is how frequently they occur. This suggests having clients 
backoff during issues would be an improvement, as this many connection so 
quickly makes things worse.

Of course, there's a tradeoff between quickly connecting and backing off. For 
this reason, perhaps exponential backoff should be the default behavior, but 
clients that that require fast reconnections could optionally use the current 
mode.

> Zookeeper outage post-mortem
> 
>
> Key: ZOOKEEPER-801
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
> Project: Zookeeper
>  Issue Type: Improvement
>Affects Versions: 3.3.0
> Environment: - RHEL5 2.6.18 kernel
> - Zookeeper 3.3.0
> - ulimit raised to 65k files
> - 3 cluster members
> - 4-5k connections in steady-state
> - Primarily C and python clients, plus some java
>Reporter: Travis Crawford
> Attachments: client-errors.txt
>
>
> [Moving a thread from the zookeeper-user]
> RECOVERY
> We eventually recovered from this situation by shutting down clients. 
> Initially I tried restarting the Zookeepers, however, they were getting 
> hammered and I believe sessions timing out. I shut down ~2k clients 
> (lightweight python app; simply sets one data watch and takes an action when 
> it changes) at which point zktop could make a connection and a leader 
> election was verified. After resetting latency stats the numbers were very 
> good. I do not believe it would have ever recovered without shedding load.
> QUORUM/ELECTIONS DURING EVENT
> Unfortunately I do not have logs from the event :( We had debug logging on, 
> and logrotate configured to keep 10 100MB files, and the interesting parts 
> rotated away. I have already switched to info logging so we don't lose this 
> data again.
> During the incident I was not able to view cluster status with zktop, and 
> never observed a successful operation beyond connections, which quickly timed 
> out.
> GC PAUSE/LOGGING
> This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
> the defaults are in the start scripts. GC logging is not enabled either. I 
> filed an internal bug against myself to enable logging, and tune GC.
> CLIENT SESSION TIMEOUTS
> Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
> 10 seconds based on this log entry when initially connecting.
> 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
> Initiating client connection, 
> host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
> sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd= 
> context=(nil) flags=0
> CLIENT BACKOFFS
> Looking in application logs, we see lots of the following:
> 2010-07-01 
> 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
> [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
> connection timed out (exceeded timeout by 0ms)
> Doing some simple aggregations we see 130 errors in a ~13 minute sample 
> period. This behavior on thousands of clients appears to have been a DDoS 
> attack against Zookeeper. Using exponential behavior as the default behavior 
> seems appropriate looking at this data.
> Fulltext of the client errors is attached. I grepped "errno" from a Python 
> client log; I believe it uses the same underlying C library, so I did not 
> include example output from a C program (though I can if needed). It looks 
> basically the same.
> GOING FORWARD
> The long-GC pause causing clients to dogpile sounds like the most plausible 
> explanation at this time. GC logging/tuning is clearly where I dropped the 
> ball, just using the defaults; I don't think any changes should be made 
> related to lack of tuning.
> Exponential backoffs does seem like a good idea, and generally useful for 
> most people. There will always be service interruptions and backoffs would be 
> a great preventive measure to get out of a dogpile situation.
> Patrick's message:
> """
> Hi Travis, as Flavio suggested would be great to get the logs. A few 
> questions:
> 1) how did you eventually recover, restart the zk servers?
> 2) was the cluster losing quorum during this time? leader re-election?
> 3) Any chance this could have been initially triggered by a long GC pause on 
> one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
> Has the GC been tuned on the servers, for example CMS and incremental?
> 4) what are the clients using for timeout on the sessions?
> 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a 
> few weeks to fix a couple critic