Ted that's great feedback. I identified a couple of additional things to verify after reading your comments:

1) ensure that you don't have debug level logging turned on, see this:
https://issues.apache.org/jira/browse/ZOOKEEPER-518
(fixed in 3.2.1, but in general you probably don't want to run anything lower than info in production except when attempting to track down some problem).


2) it would be a good idea to review the server/client zk logs to see if there's any insight there as to what might be causing the high latencies. For example the other day we had an issue where client code was misbehaving and causing degraded performance of the server, reviewing the logs allowed the developer to identify the client problem and address.

Patrick

Ted Dunning wrote:
I always used a large node for ZK to avoid sharing the machine, but the
reason for doing that turned out to be incorrect.  In fact, my problem was
to do with GC on the client side.

I can't believe that they are seeing 50 second delays in EC2 due to I/O
contention.  GC can do that, but only on a large heap.  Massive swapping of
code pages can also cause this.

My debug path here would be:

a) verify the facts.  The key fact is that the ZK cluster is occasionally
giving massive latency.  This must be verified to be the real problem and
not an accidental incident.  It is possible that the problem is not where we
think it is.

b) check for the usual configuration suspects.  ZK should be alone on a
machine.  DNS should be checked.  Connectivity should be checked between all
hosts.

c) look for swapping, look at GC logs.  Something has to give a clue as to
how the latency is 1000x longer than usual.

d) fix what came from (b) or (c) step.

I am at a loss here other than this general advice.  I strongly suspect that
something is being observed incorrectly or the machines are being massively
abused.

On Wed, Sep 2, 2009 at 12:37 PM, Patrick Hunt <ph...@apache.org> wrote:

I suspect that given a single disk is being used (not a dedicated disk for
the transaction log), and also given that this host is highly virtualized
(ec2), it seems to me that the most likely cause is IO. Specifically when
the zk cluster writes data to disk (due to client write) it must sync the
transaction log to disk. This sync behavior can impact the latency seen by
the clients. What type of ec2 node are you using? Ted, do you have any
insight on this? Any guidelines for the type of ec2 node to use for running
a zk cluster?




Reply via email to