subject:"\[jira\] Commented\: \(ZOOKEEPER\-801\) Zookeeper outage post\-mortem"

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-02 Thread Patrick Hunt (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884560#action_12884560
]

Patrick Hunt commented on ZOOKEEPER-801:

Travis, I see the address 10.209.44.182 *2140 times* in the netstat file you
provided.

Can you provide some insight on this client - what binding is this, c or java?
python, ruby... what? Also, if this client fails (say it segvs or exits for
whatever reason) is it automatically restarted by a supervisor? Basically what
I'm trying to figure out is why this client should be attempting so many
connections to the server - the client libs we provide typically have a hold
off of a second or so btw connection attempts in order that we don't overload
the server in this way. However if the client were to be restarted, or if there
was a bug in the code where connections were established in a tight loop (I've
seen a couple client side bugs like this now, where the client code thought it
failed to get a connection and therefore created a new ZK client, which cause
DOS on the server (thousands of clients were created by each client process),
this is one of the reasons why we added the maxclientcnxns check.), or a bug in
our client library this could explain why we see so many TIME_WAIT state
sockets in the netstat file.

Zookeeper outage post-mortem

Key: ZOOKEEPER-801
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
Project: Zookeeper
Issue Type: Improvement
Affects Versions: 3.3.0
Environment: - RHEL5 2.6.18 kernel
- Zookeeper 3.3.0
- ulimit raised to 65k files
- 3 cluster members
- 4-5k connections in steady-state
- Primarily C and python clients, plus some java
Reporter: Travis Crawford
Attachments: client-errors.txt, zookeeper-out-of-files.netstat

[Moving a thread from the zookeeper-user]
RECOVERY
We eventually recovered from this situation by shutting down clients.
Initially I tried restarting the Zookeepers, however, they were getting
hammered and I believe sessions timing out. I shut down ~2k clients
(lightweight python app; simply sets one data watch and takes an action when
it changes) at which point zktop could make a connection and a leader
election was verified. After resetting latency stats the numbers were very
good. I do not believe it would have ever recovered without shedding load.
QUORUM/ELECTIONS DURING EVENT
Unfortunately I do not have logs from the event :( We had debug logging on,
and logrotate configured to keep 10 100MB files, and the interesting parts
rotated away. I have already switched to info logging so we don't lose this
data again.
During the incident I was not able to view cluster status with zktop, and
never observed a successful operation beyond connections, which quickly timed
out.
GC PAUSE/LOGGING
This is a very good question. No, Zookeeper GC is not tuned and uses whatever
the defaults are in the start scripts. GC logging is not enabled either. I
filed an internal bug against myself to enable logging, and tune GC.
CLIENT SESSION TIMEOUTS
Clients are not explicitly setting timeouts, and I believe sessionTimeout is
10 seconds based on this log entry when initially connecting.
2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727:
Initiating client connection,
host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181
sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null
context=(nil) flags=0
CLIENT BACKOFFS
Looking in application logs, we see lots of the following:
2010-07-01
05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket
[10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out):
connection timed out (exceeded timeout by 0ms)
Doing some simple aggregations we see 130 errors in a ~13 minute sample
period. This behavior on thousands of clients appears to have been a DDoS
attack against Zookeeper. Using exponential behavior as the default behavior
seems appropriate looking at this data.
Fulltext of the client errors is attached. I grepped errno from a Python
client log; I believe it uses the same underlying C library, so I did not
include example output from a C program (though I can if needed). It looks
basically the same.
GOING FORWARD
The long-GC pause causing clients to dogpile sounds like the most plausible
explanation at this time. GC logging/tuning is clearly where I dropped the
ball, just using the defaults; I don't think any changes should be made
related to lack of tuning.
Exponential backoffs does seem like a good idea, and generally useful for
most people. There will always be service interruptions and backoffs would be
a great preventive measure to get out of a dogpile situation.
Patrick's message:

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-02 Thread Patrick Hunt (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884579#action_12884579
]

Patrick Hunt commented on ZOOKEEPER-801:

Travis, one more thing:

{noformat}
egrep 10.209.44.182 zookeeper-out-of-files.netstat |egrep EST |wc -l
66
{noformat}

10.209.44.182 has 66 connections established to this server. Why so many? Looks
like a potential problem with the client side code, should it be opening more
than 1 session concurrently?

Zookeeper outage post-mortem

Hi Travis, as Flavio suggested would be great to get the logs. A few
questions:
1) how did you eventually recover, restart the zk servers?
2) was the cluster losing quorum during this time? leader re-election?
3) Any chance this could have been initially triggered by a long GC pause on
one of the servers? (is gc logging turned on, any sort of heap monitoring?)
Has the GC been tuned on the servers, for example CMS and incremental?
4) what are the clients using for timeout on the sessions?
3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a
few weeks to fix a couple critical issues (which don't seem related to what
you saw). If we can identify the problem here we should be able to include it
in any fix release we do.
fixing something like 517 might help, but it's not clear how we got to

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-02 Thread Travis Crawford (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884769#action_12884769
]

Travis Crawford commented on ZOOKEEPER-801:
---

@Patrick - So this morning while investigating I discovered that yes, the
client you pointed out was indeed misbehaving. It was a new application that
had a bug.

I talked with a couple teams and basically everyone is using a different Scala
client. We've decided http://github.com/jcorwin/zookeeper-client is our best
scala client and are forking from the main twitter account so everyone knows to
use it.

I looked around the docs and didn't see a place to list the scala client;
please let me know if we should list this somewhere.

Closing this issue as its basically a client bug on our side. Thanks for the
comments and help tracking this down.

Zookeeper outage post-mortem

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-02 Thread Patrick Hunt (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884779#action_12884779
]

Patrick Hunt commented on ZOOKEEPER-801:

Ah, that's great to hear. ;-)

I would really appreciate it though if you (Travis) could enter a new JIRA.
Basically I consider it a serious problem that the server failed in this
situation. Obv fixing ZOOKEEPER-517 would help in this situation, but really we
need to ensure that mis-behaving clients will not bring down the service.
That's a serious bug. Please enter a new jira, if you could describe what the
problem was on the client side that will help us to craft a test that will
verify.

Btw, I did add a new wiki page to capture available bindings:
http://wiki.apache.org/hadoop/ZooKeeper/ZKClientBindings

However I encourage you (anyone) to submit back the code as a contrib.

Regards.

Zookeeper outage post-mortem

Key: ZOOKEEPER-801
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
Project: Zookeeper
Issue Type: Improvement
Affects Versions: 3.3.0
Environment: - RHEL5 2.6.18 kernel
- Zookeeper 3.3.0
- ulimit raised to 65k files
- 3 cluster members
- 4-5k connections in steady-state
- Primarily C and python clients, plus some java
Reporter: Travis Crawford
Assignee: Travis Crawford
Attachments: client-errors.txt, zookeeper-out-of-files.netstat

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-02 Thread Mahadev konar (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884775#action_12884775
]

Mahadev konar commented on ZOOKEEPER-801:
-

travis, can you elaborate on the misbehaved client? What was its exact
behavior? We just want to make sure if its indeed the misbehaved client or do
we need to fix something on our side as well?

Zookeeper outage post-mortem

Key: ZOOKEEPER-801
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
Project: Zookeeper
Issue Type: Improvement
Affects Versions: 3.3.0
Environment: - RHEL5 2.6.18 kernel
- Zookeeper 3.3.0
- ulimit raised to 65k files
- 3 cluster members
- 4-5k connections in steady-state
- Primarily C and python clients, plus some java
Reporter: Travis Crawford
Assignee: Travis Crawford
Attachments: client-errors.txt, zookeeper-out-of-files.netstat

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-01 Thread Mahadev konar (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884363#action_12884363
]

Mahadev konar commented on ZOOKEEPER-801:
-

the CLOSE_WAIT would probably be casued if the clients connect to servers and
then disconnect on there own. This way the server will take some time closing
the TCP connections. Any reason that the clients would shuttle arnd the servers
so frequently?

Zookeeper outage post-mortem

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-01 Thread Travis Crawford (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884368#action_12884368
]

Travis Crawford commented on ZOOKEEPER-801:
---

I believe clients were reconnecting so frequently because their
connections timed out.

Zookeeper outage post-mortem

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

7 matches

Site Navigation

Mail list logo

Footer information