Re: zookeeper on ec2

Patrick Hunt Thu, 03 Sep 2009 09:14:03 -0700

Ted that's great feedback. I identified a couple of additional things toverify after reading your comments:


1) ensure that you don't have debug level logging turned on, see this:
https://issues.apache.org/jira/browse/ZOOKEEPER-518

(fixed in 3.2.1, but in general you probably don't want to run anythinglower than info in production except when attempting to track down someproblem).

2) it would be a good idea to review the server/client zk logs to see ifthere's any insight there as to what might be causing the highlatencies. For example the other day we had an issue where client codewas misbehaving and causing degraded performance of the server,reviewing the logs allowed the developer to identify the client problemand address.


Patrick

Ted Dunning wrote:

I always used a large node for ZK to avoid sharing the machine, but the
reason for doing that turned out to be incorrect.  In fact, my problem was
to do with GC on the client side.

I can't believe that they are seeing 50 second delays in EC2 due to I/O
contention.  GC can do that, but only on a large heap.  Massive swapping of
code pages can also cause this.

My debug path here would be:

a) verify the facts.  The key fact is that the ZK cluster is occasionally
giving massive latency.  This must be verified to be the real problem and
not an accidental incident.  It is possible that the problem is not where we
think it is.

b) check for the usual configuration suspects.  ZK should be alone on a
machine.  DNS should be checked.  Connectivity should be checked between all
hosts.

c) look for swapping, look at GC logs.  Something has to give a clue as to
how the latency is 1000x longer than usual.

d) fix what came from (b) or (c) step.

I am at a loss here other than this general advice.  I strongly suspect that
something is being observed incorrectly or the machines are being massively
abused.

On Wed, Sep 2, 2009 at 12:37 PM, Patrick Hunt <ph...@apache.org> wrote:

I suspect that given a single disk is being used (not a dedicated disk for
the transaction log), and also given that this host is highly virtualized
(ec2), it seems to me that the most likely cause is IO. Specifically when
the zk cluster writes data to disk (due to client write) it must sync the
transaction log to disk. This sync behavior can impact the latency seen by
the clients. What type of ec2 node are you using? Ted, do you have any
insight on this? Any guidelines for the type of ec2 node to use for running
a zk cluster?

Re: zookeeper on ec2

Reply via email to