Re: Zookeeper leader is terminated in AWS and cluster never recovers

Flavio Junqueira Wed, 27 Jan 2016 07:54:05 -0800

Perhaps you want to upload your logs somewhere for us to take a look. There has 
been issues around name resolution, like this one:


https://issues.apache.org/jira/browse/ZOOKEEPER-1506 
<https://issues.apache.org/jira/browse/ZOOKEEPER-1506>

Is there any chance it could be related to it?

-Flavio 


> On 26 Jan 2016, at 22:19, Jim Wong <[email protected]> wrote:
> 
> Folks,
> 
> Does anyone have any thoughts on this one? We can semi-reliably cause our 
> cluster to get into a seemingly permanent bad stat--at least several minutes, 
> which is longer than we can tolerate in production--just by terminating the 
> leader. We’ve seen recommendations to change things like the amount of time 
> allotted for the database to be reconstructed from the last snapshot, but 
> none of them have helped.
> 
> If nothing else, are there any particular log messages we should be looking 
> for or troubleshooting strategies we can pursue?
> 
> 
> 
>> On Jan 25, 2016, at 11:30 AM, Ken Mamitsuka <[email protected]> wrote:
>> 
>> This may be related to the post "All nodes shutting down simultaneously",
>> but I'm not sure. I was able to reproduce this quite a few times. I
>> terminated the leader of a 5 node zookeeper cluster (terminated from the
>> AWS console). All the remaining hosts saw the leader disappear:
>> 
>> Jan 20 18:40:05 localhost zookeeper: WARN
>> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
>> following the leader
>> Jan 20 18:40:05 localhost java.net.SocketTimeoutException: Read timed out
>> 
>> Then it looks like a shutdown of the ZooKeeperServer and  its processors:
>> 
>> Jan 20 18:40:05 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@467] - shutting
>> down
>> 
>> I’m not sure if we should see startup logs or not. Then we see the leader
>> election appear to be successful (I’m not sure if the “ZooKeeperServer not
>> running” log is ominous or not):
>> 
>> Jan 20 18:40:05 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
>> Jan 20 18:40:05 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot
>> /zookeeper/data/version-2/snapshot.8005dc366
>> Jan 20 18:40:05 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
>> election. My id =  2, proposed zxid=0x8005e12e8
>> Jan 20 18:40:05 localhost zookeeper: INFO
>> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
>> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
>> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
>> ...
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
>> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
>> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@168] - Created
>> server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
>> datadir /zookeeper/data/version-2 snapdir /zookeeper/data/version-2
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEADER
>> ELECTION TOOK - 347
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
>> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
>> LOOKING (n.state), 4 (n.sid), 0x8 (n.peerEpoch) FOLLOWING (my state)
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] -
>> Resolved hostname: zookeeper-5-internal.prod.airtime.com to address:
>> zookeeper-5-internal.prod.airtime.com/10.1.24.208
>> 
>> However, attempts to connect to the new leader just fail forever:
>> 
>> Jan 20 18:40:06 localhost zookeeper: WARN
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
>> exception, tries=0, connecting to
>> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
>> Jan 20 18:40:07 localhost zookeeper: WARN
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
>> exception, tries=1, connecting to
>> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
>> Jan 20 18:40:08 localhost zookeeper: WARN
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
>> exception, tries=2, connecting to
>> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
>> Jan 20 18:40:09 localhost zookeeper: WARN
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
>> exception, tries=3, connecting to
>> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
>> 
>> Then it throws an exception and shuts down:
>> 
>> Jan 20 18:40:10 localhost zookeeper: WARN
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
>> following the leader
>> Jan 20 18:40:10 localhost java.net.ConnectException: Connection refused
>> Jan 20 18:40:10 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
>> Jan 20 18:40:10 localhost java.lang.Exception: shutdown Follower
>> Jan 20 18:40:10 localhost     at
>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>> Jan 20 18:40:10 localhost     at
>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
>> Jan 20 18:40:10 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] -
>> Shutting down
>> Jan 20 18:40:10 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
>> Jan 20 18:40:10 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
>> election. My id =  2, proposed zxid=0x8005e12e8
>> Jan 20 18:40:10 localhost zookeeper: INFO
>> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
>> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe2 (n.round),
>> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
>> 
>> And then the fast leader election fails essentially forever:
>> 
>> Jan 20 18:40:10 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 400
>> Jan 20 18:40:10 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 800
>> Jan 20 18:40:11 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 1600
>> Jan 20 18:40:13 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 3200
>> Jan 20 18:40:16 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 6400
>> Jan 20 18:40:22 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 12800
>> Jan 20 18:40:35 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 25600
>> Jan 20 18:41:01 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 51200
>> Jan 20 18:41:52 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 60000
>> 
>> Looking at the node elected as the new leader (zookeeper-5.prod.airtime.com),
>> it doesn’t seem like it ever recognizes that it was the leader of the
>> original election. I’m not sure if it’s because it can’t start or something
>> else.
>> 
>> Jan 20 18:40:05 localhost zookeeper: INFO
>> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
>> election. My id =  5, proposed zxid=0x8005e12e8
>> Jan 20 18:40:05 localhost zookeeper: INFO
>> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
>> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
>> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 400
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
>> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
>> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 800
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
>> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
>> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
>> Jan 20 18:40:07 localhost zookeeper: INFO
>> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 1600
>> 
>> A couple of additional notes:
>> 
>>  1. Restarting zookeeper on all hosts seemed to resolve the issue.
>>  2. This was easily reproducible for me in 3 different zookeeper clusters
>>  in AWS on Jan 20/21. Other things like killing follower hosts worked fine.
>>  3. I wasn't able to reproduce this yesterday (Jan 24th), so I wonder if
>>  anything network related could make this situation fail, while everything
>>  else worked fine.
>

Re: Zookeeper leader is terminated in AWS and cluster never recovers

Reply via email to