Perhaps you want to upload your logs somewhere for us to take a look. There has been issues around name resolution, like this one:
https://issues.apache.org/jira/browse/ZOOKEEPER-1506 <https://issues.apache.org/jira/browse/ZOOKEEPER-1506> Is there any chance it could be related to it? -Flavio > On 26 Jan 2016, at 22:19, Jim Wong <[email protected]> wrote: > > Folks, > > Does anyone have any thoughts on this one? We can semi-reliably cause our > cluster to get into a seemingly permanent bad stat--at least several minutes, > which is longer than we can tolerate in production--just by terminating the > leader. We’ve seen recommendations to change things like the amount of time > allotted for the database to be reconstructed from the last snapshot, but > none of them have helped. > > If nothing else, are there any particular log messages we should be looking > for or troubleshooting strategies we can pursue? > > > >> On Jan 25, 2016, at 11:30 AM, Ken Mamitsuka <[email protected]> wrote: >> >> This may be related to the post "All nodes shutting down simultaneously", >> but I'm not sure. I was able to reproduce this quite a few times. I >> terminated the leader of a 5 node zookeeper cluster (terminated from the >> AWS console). All the remaining hosts saw the leader disappear: >> >> Jan 20 18:40:05 localhost zookeeper: WARN >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when >> following the leader >> Jan 20 18:40:05 localhost java.net.SocketTimeoutException: Read timed out >> >> Then it looks like a shutdown of the ZooKeeperServer and its processors: >> >> Jan 20 18:40:05 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@467] - shutting >> down >> >> I’m not sure if we should see startup logs or not. Then we see the leader >> election appear to be successful (I’m not sure if the “ZooKeeperServer not >> running” log is ominous or not): >> >> Jan 20 18:40:05 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING >> Jan 20 18:40:05 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot >> /zookeeper/data/version-2/snapshot.8005dc366 >> Jan 20 18:40:05 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New >> election. My id = 2, proposed zxid=0x8005e12e8 >> Jan 20 18:40:05 localhost zookeeper: INFO >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message >> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), >> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) >> ... >> Jan 20 18:40:06 localhost zookeeper: INFO >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) >> Jan 20 18:40:06 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING >> Jan 20 18:40:06 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@168] - Created >> server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 >> datadir /zookeeper/data/version-2 snapdir /zookeeper/data/version-2 >> Jan 20 18:40:06 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEADER >> ELECTION TOOK - 347 >> Jan 20 18:40:06 localhost zookeeper: INFO >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), >> LOOKING (n.state), 4 (n.sid), 0x8 (n.peerEpoch) FOLLOWING (my state) >> Jan 20 18:40:06 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] - >> Resolved hostname: zookeeper-5-internal.prod.airtime.com to address: >> zookeeper-5-internal.prod.airtime.com/10.1.24.208 >> >> However, attempts to connect to the new leader just fail forever: >> >> Jan 20 18:40:06 localhost zookeeper: WARN >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected >> exception, tries=0, connecting to >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 >> Jan 20 18:40:07 localhost zookeeper: WARN >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected >> exception, tries=1, connecting to >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 >> Jan 20 18:40:08 localhost zookeeper: WARN >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected >> exception, tries=2, connecting to >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 >> Jan 20 18:40:09 localhost zookeeper: WARN >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected >> exception, tries=3, connecting to >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 >> >> Then it throws an exception and shuts down: >> >> Jan 20 18:40:10 localhost zookeeper: WARN >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when >> following the leader >> Jan 20 18:40:10 localhost java.net.ConnectException: Connection refused >> Jan 20 18:40:10 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called >> Jan 20 18:40:10 localhost java.lang.Exception: shutdown Follower >> Jan 20 18:40:10 localhost at >> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) >> Jan 20 18:40:10 localhost at >> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850) >> Jan 20 18:40:10 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] - >> Shutting down >> Jan 20 18:40:10 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING >> Jan 20 18:40:10 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New >> election. My id = 2, proposed zxid=0x8005e12e8 >> Jan 20 18:40:10 localhost zookeeper: INFO >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message >> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe2 (n.round), >> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) >> >> And then the fast leader election fails essentially forever: >> >> Jan 20 18:40:10 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - >> Notification time out: 400 >> Jan 20 18:40:10 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - >> Notification time out: 800 >> Jan 20 18:40:11 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - >> Notification time out: 1600 >> Jan 20 18:40:13 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - >> Notification time out: 3200 >> Jan 20 18:40:16 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - >> Notification time out: 6400 >> Jan 20 18:40:22 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - >> Notification time out: 12800 >> Jan 20 18:40:35 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - >> Notification time out: 25600 >> Jan 20 18:41:01 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - >> Notification time out: 51200 >> Jan 20 18:41:52 localhost zookeeper: INFO >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - >> Notification time out: 60000 >> >> Looking at the node elected as the new leader (zookeeper-5.prod.airtime.com), >> it doesn’t seem like it ever recognizes that it was the leader of the >> original election. I’m not sure if it’s because it can’t start or something >> else. >> >> Jan 20 18:40:05 localhost zookeeper: INFO >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New >> election. My id = 5, proposed zxid=0x8005e12e8 >> Jan 20 18:40:05 localhost zookeeper: INFO >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) >> Jan 20 18:40:06 localhost zookeeper: INFO >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - >> Notification time out: 400 >> Jan 20 18:40:06 localhost zookeeper: INFO >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) >> Jan 20 18:40:06 localhost zookeeper: INFO >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - >> Notification time out: 800 >> Jan 20 18:40:06 localhost zookeeper: INFO >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) >> Jan 20 18:40:07 localhost zookeeper: INFO >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - >> Notification time out: 1600 >> >> A couple of additional notes: >> >> 1. Restarting zookeeper on all hosts seemed to resolve the issue. >> 2. This was easily reproducible for me in 3 different zookeeper clusters >> in AWS on Jan 20/21. Other things like killing follower hosts worked fine. >> 3. I wasn't able to reproduce this yesterday (Jan 24th), so I wonder if >> anything network related could make this situation fail, while everything >> else worked fine. >
