This may be related to the post "All nodes shutting down simultaneously", but I'm not sure. I was able to reproduce this quite a few times. I terminated the leader of a 5 node zookeeper cluster (terminated from the AWS console). All the remaining hosts saw the leader disappear:
Jan 20 18:40:05 localhost zookeeper: WARN [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader Jan 20 18:40:05 localhost java.net.SocketTimeoutException: Read timed out Then it looks like a shutdown of the ZooKeeperServer and its processors: Jan 20 18:40:05 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@467] - shutting down I’m not sure if we should see startup logs or not. Then we see the leader election appear to be successful (I’m not sure if the “ZooKeeperServer not running” log is ominous or not): Jan 20 18:40:05 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING Jan 20 18:40:05 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot /zookeeper/data/version-2/snapshot.8005dc366 Jan 20 18:40:05 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id = 2, proposed zxid=0x8005e12e8 Jan 20 18:40:05 localhost zookeeper: INFO [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) ... Jan 20 18:40:06 localhost zookeeper: INFO [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) Jan 20 18:40:06 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING Jan 20 18:40:06 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@168] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /zookeeper/data/version-2 snapdir /zookeeper/data/version-2 Jan 20 18:40:06 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEADER ELECTION TOOK - 347 Jan 20 18:40:06 localhost zookeeper: INFO [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), LOOKING (n.state), 4 (n.sid), 0x8 (n.peerEpoch) FOLLOWING (my state) Jan 20 18:40:06 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] - Resolved hostname: zookeeper-5-internal.prod.airtime.com to address: zookeeper-5-internal.prod.airtime.com/10.1.24.208 However, attempts to connect to the new leader just fail forever: Jan 20 18:40:06 localhost zookeeper: WARN [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected exception, tries=0, connecting to zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 Jan 20 18:40:07 localhost zookeeper: WARN [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected exception, tries=1, connecting to zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 Jan 20 18:40:08 localhost zookeeper: WARN [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected exception, tries=2, connecting to zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 Jan 20 18:40:09 localhost zookeeper: WARN [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected exception, tries=3, connecting to zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 Then it throws an exception and shuts down: Jan 20 18:40:10 localhost zookeeper: WARN [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader Jan 20 18:40:10 localhost java.net.ConnectException: Connection refused Jan 20 18:40:10 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called Jan 20 18:40:10 localhost java.lang.Exception: shutdown Follower Jan 20 18:40:10 localhost at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) Jan 20 18:40:10 localhost at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850) Jan 20 18:40:10 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] - Shutting down Jan 20 18:40:10 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING Jan 20 18:40:10 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id = 2, proposed zxid=0x8005e12e8 Jan 20 18:40:10 localhost zookeeper: INFO [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe2 (n.round), LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) And then the fast leader election fails essentially forever: Jan 20 18:40:10 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 400 Jan 20 18:40:10 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 800 Jan 20 18:40:11 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 1600 Jan 20 18:40:13 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 3200 Jan 20 18:40:16 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 6400 Jan 20 18:40:22 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 12800 Jan 20 18:40:35 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 25600 Jan 20 18:41:01 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 51200 Jan 20 18:41:52 localhost zookeeper: INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 Looking at the node elected as the new leader (zookeeper-5.prod.airtime.com), it doesn’t seem like it ever recognizes that it was the leader of the original election. I’m not sure if it’s because it can’t start or something else. Jan 20 18:40:05 localhost zookeeper: INFO [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id = 5, proposed zxid=0x8005e12e8 Jan 20 18:40:05 localhost zookeeper: INFO [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) Jan 20 18:40:06 localhost zookeeper: INFO [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 400 Jan 20 18:40:06 localhost zookeeper: INFO [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) Jan 20 18:40:06 localhost zookeeper: INFO [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 800 Jan 20 18:40:06 localhost zookeeper: INFO [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) Jan 20 18:40:07 localhost zookeeper: INFO [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 1600 A couple of additional notes: 1. Restarting zookeeper on all hosts seemed to resolve the issue. 2. This was easily reproducible for me in 3 different zookeeper clusters in AWS on Jan 20/21. Other things like killing follower hosts worked fine. 3. I wasn't able to reproduce this yesterday (Jan 24th), so I wonder if anything network related could make this situation fail, while everything else worked fine.
