Yes, 965 seconds is huge.
The times I've seen such huge latencies are (in order of frequency seen):
1) when the java process gc's, swaps, or both
and/or
2) disk utilization on the ZK server is high
and/or
3) under-provisioned virtual machines (ie vmware)
Re 2) in some cases we've seen users running multiple IO heavy processes
on the same host as the zk server. hdfs data nodes and/or hbase region
servers are two common examples. In those cases I've monitored "sudo
iostat -x 3", specifically the "%util", and when this case is occurring
I've seen heavy utilization concurrent with the max latency spiking up.
ZK servers need to sync a file to disk on every write operation, maxed
out disk utilization (esp on ext3 fs) will cause a spike in max latency
as the server has to wait for the sync to complete.
I notice you have ~1mil writes and ~1mil reads according to the current
server stats. It seems you are doing 1 read for every write (so
relatively low read to write ratio... not extremely so but something to
consider)
You might reset the latency using the "srst" 4 letter word
http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands
then monitor the max latency, the GC history, the "swappyness" of the
java process, and esp the disk utilization. You can correlate a jump in
latency against these (gc, swap, %util, etc...).
useful tool: http://github.com/phunt/zktop
Given that you've ruled out the gc (most common), disk utilization would
be the next thing to check.
Patrick
On 06/15/2010 11:20 AM, Jordan Zimmerman wrote:
Yes - the session drop happened again. I did the stat. The max latency is huge
(I assume that's in ms).
Zookeeper version: 3.3.0-925362, built on 03/19/2010 18:38 GMT
Clients:
/10.243.14.179:57300[1](queued=0,recved=0,sent=0)
/207.111.236.2:51493[1](queued=0,recved=1,sent=0)
/10.243.13.191:44414[1](queued=0,recved=0,sent=0)
/10.243.13.191:44415[1](queued=0,recved=0,sent=0)
/10.244.135.100:58965[1](queued=0,recved=0,sent=0)
/10.243.15.192:45057[1](queued=0,recved=0,sent=0)
Latency min/avg/max: 0/6/965759
Received: 2203048
Sent: 2202979
Outstanding: 0
Zxid: 0xef2b2
Mode: standalone
Node count: 347
On Jun 15, 2010, at 11:02 AM, Ted Dunning wrote:
Jordan,
Good step to get this info.
I have to ask, did you have your disconnect problem last night as well?
(just checking)
What does the stat command on ZK give you for each server?
On Tue, Jun 15, 2010 at 10:33 AM, Jordan Zimmerman<
jzimmer...@proofpoint.com> wrote:
More on this...
I ran last night with verbose GC on our client. I analyzed the GC log in
gchisto and 99% of the GCs are 1 or 2 ms. The longest gc is 30 ms. On the
Zookeeper server side, the longest gc is 130 ms. So, I submit, GC is not the
problem. NOTE we're running on Amazon EC2.