Re: Debugging help for SessionExpiredException

Patrick Hunt Tue, 15 Jun 2010 12:30:34 -0700

Yes, 965 seconds is huge.

The times I've seen such huge latencies are (in order of frequency seen):


1) when the java process gc's, swaps, or both

and/or

2) disk utilization on the ZK server is high

and/or

3) under-provisioned virtual machines (ie vmware)

Re 2) in some cases we've seen users running multiple IO heavy processeson the same host as the zk server. hdfs data nodes and/or hbase regionservers are two common examples. In those cases I've monitored "sudoiostat -x 3", specifically the "%util", and when this case is occurringI've seen heavy utilization concurrent with the max latency spiking up.ZK servers need to sync a file to disk on every write operation, maxedout disk utilization (esp on ext3 fs) will cause a spike in max latencyas the server has to wait for the sync to complete.

I notice you have ~1mil writes and ~1mil reads according to the currentserver stats. It seems you are doing 1 read for every write (sorelatively low read to write ratio... not extremely so but something toconsider)


You might reset the latency using the "srst" 4 letter word
http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands

then monitor the max latency, the GC history, the "swappyness" of thejava process, and esp the disk utilization. You can correlate a jump inlatency against these (gc, swap, %util, etc...).


useful tool: http://github.com/phunt/zktop

Given that you've ruled out the gc (most common), disk utilization wouldbe the next thing to check.


Patrick

On 06/15/2010 11:20 AM, Jordan Zimmerman wrote:

Yes - the session drop happened again. I did the stat. The max latency is huge 
(I assume that's in ms).

Zookeeper version: 3.3.0-925362, built on 03/19/2010 18:38 GMT
Clients:
  /10.243.14.179:57300[1](queued=0,recved=0,sent=0)
  /207.111.236.2:51493[1](queued=0,recved=1,sent=0)
  /10.243.13.191:44414[1](queued=0,recved=0,sent=0)
  /10.243.13.191:44415[1](queued=0,recved=0,sent=0)
  /10.244.135.100:58965[1](queued=0,recved=0,sent=0)
  /10.243.15.192:45057[1](queued=0,recved=0,sent=0)

Latency min/avg/max: 0/6/965759
Received: 2203048
Sent: 2202979
Outstanding: 0
Zxid: 0xef2b2
Mode: standalone
Node count: 347


On Jun 15, 2010, at 11:02 AM, Ted Dunning wrote:

Jordan,

Good step to get this info.

I have to ask, did you have your disconnect problem last night as well?
(just checking)

What does the stat command on ZK give you for each server?

On Tue, Jun 15, 2010 at 10:33 AM, Jordan Zimmerman<
jzimmer...@proofpoint.com>  wrote:

More on this...

I ran last night with verbose GC on our client. I analyzed the GC log in
gchisto and 99% of the GCs are 1 or 2 ms. The longest gc is 30 ms. On the
Zookeeper server side, the longest gc is 130 ms. So, I submit, GC is not the
problem. NOTE we're running on Amazon EC2.

Re: Debugging help for SessionExpiredException

Reply via email to