Hi Mark, From your logs, server 2 was leading and was followed only by server 1:

2012-05-09 01:07:25,523 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2183:Leader@390] - 
Shutdown called
java.lang.Exception: shutdown Leader! reason: Only 0 followers, need 1
        at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:390)
        at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:367)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:658)

Consequently when you shut down 1 the ensemble lost quorum. The sequence of 
notifications made 3 think that it was leading, but it didn't become 
established as a leader because it didn't have a quorum supporting. Eventually 
3 gives up and starts following 2:

2012-05-09 01:08:08,961 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] 
- Notification: 2 (n.leader), 4294967297 (n.zxid), 2 (n.round), LOOKING 
(n.state), 3 (n.sid), LOOKING (my state)
2012-05-09 01:08:09,163 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2184:QuorumPeer@643] - FOLLOWING

and 2 leading:

2012-05-09 01:08:08,959 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] 
- Notification: 2 (n.leader), 4294967297 (n.zxid), 2 (n.round), LOOKING 
(n.state), 2 (n.sid), LOOKING (my state)
2012-05-09 01:08:08,959 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] 
- Notification: 3 (n.leader), 0 (n.zxid), 2 (n.round), LOOKING (n.state), 3 
(n.sid), LOOKING (my state)
2012-05-09 01:08:08,961 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] 
- Notification: 2 (n.leader), 4294967297 (n.zxid), 2 (n.round), LOOKING 
(n.state), 3 (n.sid), LOOKING (my state)
2012-05-09 01:08:09,162 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2183:QuorumPeer@655] - LEADING

I'm not sure why it took so much time for the notifications to propagate, 
though.

-Flavio

On May 10, 2012, at 10:35 PM, Mark Gius wrote:

> I'm doing some testing around a Client being connected to a zookeeper
> endpoint that goes away and I'm seeing what appears to be a "settling"
> period that is causing some errors.
> 
> The test is as follows:
> 
> 1) Three zookeeper servers are started up on the same host, configured to
> cluster with each other.
> 2) A Client is created and attaches to Server 1 (using
> deterministic_conn_order flag to force this)
> 3) Shut down Server 1 (which is NOT the Leader)
> 4) Servers 2 and 3 still have quorum.  Interruption of service should be
> minimal.
> 5) The Client _should_ reconnect immediately to Server 2 or 3.
> 
> The behavior I am seeing in practice is that after shutting down Server 1
> quorum is lost and the Client takes on the order of 15-20 seconds to
> re-establish a connection to the cluster.  I do not see this behavior on a
> cluster that has existed for some time (say, 30-60 seconds).  I also do not
> see this problem on a cluster whose tickTime has been decreased to 100ms
> from the default of 2000ms.
> 
> Is there a settling period that occurs immediately after a Leader is
> elected such that quorate changes during that time cause a full leader
> election when one might not otherwise be necessary?  If so, where can I
> find information about how this settling period behaves?
> 
> I have uploaded the logs for each of the three zookeeper servers here:
> https://gist.github.com/2655709
> 
> Mark


Reply via email to