Hmm... so then it looks like the problem was that I needed to give 3 a little more time to join the quorum before shooting 1 so as to maintain quorum throughout the test. I'll give that a shot. Thanks!
Mark On Fri, May 11, 2012 at 4:25 AM, Flavio Junqueira <[email protected]> wrote: > Hi Mark, From your logs, server 2 was leading and was followed only by > server 1: > > 2012-05-09 01:07:25,523 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2183:Leader@390] - Shutdown called > java.lang.Exception: shutdown Leader! reason: Only 0 followers, need 1 > at > org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:390) > at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:367) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:658) > > Consequently when you shut down 1 the ensemble lost quorum. The sequence > of notifications made 3 think that it was leading, but it didn't become > established as a leader because it didn't have a quorum supporting. > Eventually 3 gives up and starts following 2: > > 2012-05-09 01:08:08,961 - INFO [WorkerReceiver > Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 4294967297 > (n.zxid), 2 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state) > 2012-05-09 01:08:09,163 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2184:QuorumPeer@643] - FOLLOWING > > and 2 leading: > > 2012-05-09 01:08:08,959 - INFO [WorkerReceiver > Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 4294967297 > (n.zxid), 2 (n.round), LOOKING (n.state), 2 (n.sid), LOOKING (my state) > 2012-05-09 01:08:08,959 - INFO [WorkerReceiver > Thread:FastLeaderElection@496] - Notification: 3 (n.leader), 0 (n.zxid), > 2 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state) > 2012-05-09 01:08:08,961 - INFO [WorkerReceiver > Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 4294967297 > (n.zxid), 2 (n.round), LOOKING (n.state), 3 (n.sid), LOOKING (my state) > 2012-05-09 01:08:09,162 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2183:QuorumPeer@655] - LEADING > > I'm not sure why it took so much time for the notifications to propagate, > though. > > -Flavio > > On May 10, 2012, at 10:35 PM, Mark Gius wrote: > > > I'm doing some testing around a Client being connected to a zookeeper > > endpoint that goes away and I'm seeing what appears to be a "settling" > > period that is causing some errors. > > > > The test is as follows: > > > > 1) Three zookeeper servers are started up on the same host, configured to > > cluster with each other. > > 2) A Client is created and attaches to Server 1 (using > > deterministic_conn_order flag to force this) > > 3) Shut down Server 1 (which is NOT the Leader) > > 4) Servers 2 and 3 still have quorum. Interruption of service should be > > minimal. > > 5) The Client _should_ reconnect immediately to Server 2 or 3. > > > > The behavior I am seeing in practice is that after shutting down Server 1 > > quorum is lost and the Client takes on the order of 15-20 seconds to > > re-establish a connection to the cluster. I do not see this behavior on > a > > cluster that has existed for some time (say, 30-60 seconds). I also do > not > > see this problem on a cluster whose tickTime has been decreased to 100ms > > from the default of 2000ms. > > > > Is there a settling period that occurs immediately after a Leader is > > elected such that quorate changes during that time cause a full leader > > election when one might not otherwise be necessary? If so, where can I > > find information about how this settling period behaves? > > > > I have uploaded the logs for each of the three zookeeper servers here: > > https://gist.github.com/2655709 > > > > Mark > > >
