I always used a large node for ZK to avoid sharing the machine, but the
reason for doing that turned out to be incorrect. In fact, my problem was
to do with GC on the client side.
I can't believe that they are seeing 50 second delays in EC2 due to I/O
contention. GC can do that, but only on a large heap. Massive swapping of
code pages can also cause this.
My debug path here would be:
a) verify the facts. The key fact is that the ZK cluster is occasionally
giving massive latency. This must be verified to be the real problem and
not an accidental incident. It is possible that the problem is not where we
think it is.
b) check for the usual configuration suspects. ZK should be alone on a
machine. DNS should be checked. Connectivity should be checked between all
c) look for swapping, look at GC logs. Something has to give a clue as to
how the latency is 1000x longer than usual.
d) fix what came from (b) or (c) step.
I am at a loss here other than this general advice. I strongly suspect that
something is being observed incorrectly or the machines are being massively
On Wed, Sep 2, 2009 at 12:37 PM, Patrick Hunt <ph...@apache.org> wrote:
> I suspect that given a single disk is being used (not a dedicated disk for
> the transaction log), and also given that this host is highly virtualized
> (ec2), it seems to me that the most likely cause is IO. Specifically when
> the zk cluster writes data to disk (due to client write) it must sync the
> transaction log to disk. This sync behavior can impact the latency seen by
> the clients. What type of ec2 node are you using? Ted, do you have any
> insight on this? Any guidelines for the type of ec2 node to use for running
> a zk cluster?
Ted Dunning, CTO