I've been using Zookeeper 3.3.5 and the time needed to reconnect after leader death seems to be close to 4 seconds. I think a large part of it is due to the server nodes needing to confirm the death of their leader by heartbeats.
Best Regards, Martin Kou On Thu, May 10, 2012 at 1:35 PM, Mark Gius <[email protected]> wrote: > I'm doing some testing around a Client being connected to a zookeeper > endpoint that goes away and I'm seeing what appears to be a "settling" > period that is causing some errors. > > The test is as follows: > > 1) Three zookeeper servers are started up on the same host, configured to > cluster with each other. > 2) A Client is created and attaches to Server 1 (using > deterministic_conn_order flag to force this) > 3) Shut down Server 1 (which is NOT the Leader) > 4) Servers 2 and 3 still have quorum. Interruption of service should be > minimal. > 5) The Client _should_ reconnect immediately to Server 2 or 3. > > The behavior I am seeing in practice is that after shutting down Server 1 > quorum is lost and the Client takes on the order of 15-20 seconds to > re-establish a connection to the cluster. I do not see this behavior on a > cluster that has existed for some time (say, 30-60 seconds). I also do not > see this problem on a cluster whose tickTime has been decreased to 100ms > from the default of 2000ms. > > Is there a settling period that occurs immediately after a Leader is > elected such that quorate changes during that time cause a full leader > election when one might not otherwise be necessary? If so, where can I > find information about how this settling period behaves? > > I have uploaded the logs for each of the three zookeeper servers here: > https://gist.github.com/2655709 > > Mark >
