Thanks Raul. I haven't been able to repro the issue recently, but I'll get that info when/if I am able to do so.
On Wed, Jan 27, 2016 at 5:39 PM, Raúl Gutiérrez Segalés <[email protected] > wrote: > Can you see traffic on the election and ZAB ports? > > You can sniff these messages via https://github.com/twitter/zktraffic. > > Maybe jstack the processes and post that as well? > > -rgs > On Jan 27, 2016 10:45 AM, "Ken Mamitsuka" <[email protected]> wrote: > > > Thanks Flavio, I uploaded the log files to: > > > > > > > https://drive.google.com/folderview?id=0B_aqYHmVv9E-LUZJYnlJRDYyNjg&usp=sharing > > > > log_follower - log from one of the followers > > log_new_leader - log from the elected leader (that everyone fails to > > connect to) > > > > The issue is triggered when we killed the original leader at: > > > > Jan 20 18:40:05 localhost zookeeper: WARN > > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > > following the leader > > > > and recovered after i restarted all the zookeepers (monit restart > > zookeeper-server which just wraps zkServer.sh): > > > > Jan 20 18:52:50 zookeeper-2 ansible-service: Invoked with > > name=zookeeper-server pattern=None enabled=None state=restarted > sleep=None > > arguments= runlevel=default > > > > I don't believe this is related to ZOOKEEPER-1506 because nothing in DNS > > changed, we just killed the leader (and the resolved ip address for > > zookeeper-5.eng.airtime.com is valid). > > > > Thanks again for looking at this. > > > > On Wed, Jan 27, 2016 at 7:52 AM, Flavio Junqueira <[email protected]> > wrote: > > > > > Perhaps you want to upload your logs somewhere for us to take a look. > > > There has been issues around name resolution, like this one: > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-1506 < > > > https://issues.apache.org/jira/browse/ZOOKEEPER-1506> > > > > > > Is there any chance it could be related to it? > > > > > > -Flavio > > > > > > > > > > On 26 Jan 2016, at 22:19, Jim Wong <[email protected]> wrote: > > > > > > > > Folks, > > > > > > > > Does anyone have any thoughts on this one? We can semi-reliably cause > > > our cluster to get into a seemingly permanent bad stat--at least > several > > > minutes, which is longer than we can tolerate in production--just by > > > terminating the leader. We’ve seen recommendations to change things > like > > > the amount of time allotted for the database to be reconstructed from > the > > > last snapshot, but none of them have helped. > > > > > > > > If nothing else, are there any particular log messages we should be > > > looking for or troubleshooting strategies we can pursue? > > > > > > > > > > > > > > > >> On Jan 25, 2016, at 11:30 AM, Ken Mamitsuka <[email protected]> > wrote: > > > >> > > > >> This may be related to the post "All nodes shutting down > > > simultaneously", > > > >> but I'm not sure. I was able to reproduce this quite a few times. I > > > >> terminated the leader of a 5 node zookeeper cluster (terminated from > > the > > > >> AWS console). All the remaining hosts saw the leader disappear: > > > >> > > > >> Jan 20 18:40:05 localhost zookeeper: WARN > > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception > > when > > > >> following the leader > > > >> Jan 20 18:40:05 localhost java.net.SocketTimeoutException: Read > timed > > > out > > > >> > > > >> Then it looks like a shutdown of the ZooKeeperServer and its > > > processors: > > > >> > > > >> Jan 20 18:40:05 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@467] - > > > shutting > > > >> down > > > >> > > > >> I’m not sure if we should see startup logs or not. Then we see the > > > leader > > > >> election appear to be successful (I’m not sure if the > “ZooKeeperServer > > > not > > > >> running” log is ominous or not): > > > >> > > > >> Jan 20 18:40:05 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING > > > >> Jan 20 18:40:05 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading > > > snapshot > > > >> /zookeeper/data/version-2/snapshot.8005dc366 > > > >> Jan 20 18:40:05 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - > > New > > > >> election. My id = 2, proposed zxid=0x8005e12e8 > > > >> Jan 20 18:40:05 localhost zookeeper: INFO > > > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 > > > (message > > > >> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), > > > >> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) > > > >> ... > > > >> Jan 20 18:40:06 localhost zookeeper: INFO > > > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 > > > (message > > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), > > > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) > > > >> Jan 20 18:40:06 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - > FOLLOWING > > > >> Jan 20 18:40:06 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@168] - > > Created > > > >> server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout > > 40000 > > > >> datadir /zookeeper/data/version-2 snapdir /zookeeper/data/version-2 > > > >> Jan 20 18:40:06 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - > > > LEADER > > > >> ELECTION TOOK - 347 > > > >> Jan 20 18:40:06 localhost zookeeper: INFO > > > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 > > > (message > > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), > > > >> LOOKING (n.state), 4 (n.sid), 0x8 (n.peerEpoch) FOLLOWING (my state) > > > >> Jan 20 18:40:06 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149 > ] > > - > > > >> Resolved hostname: zookeeper-5-internal.prod.airtime.com to > address: > > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208 > > > >> > > > >> However, attempts to connect to the new leader just fail forever: > > > >> > > > >> Jan 20 18:40:06 localhost zookeeper: WARN > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected > > > >> exception, tries=0, connecting to > > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 > > > >> Jan 20 18:40:07 localhost zookeeper: WARN > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected > > > >> exception, tries=1, connecting to > > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 > > > >> Jan 20 18:40:08 localhost zookeeper: WARN > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected > > > >> exception, tries=2, connecting to > > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 > > > >> Jan 20 18:40:09 localhost zookeeper: WARN > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected > > > >> exception, tries=3, connecting to > > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888 > > > >> > > > >> Then it throws an exception and shuts down: > > > >> > > > >> Jan 20 18:40:10 localhost zookeeper: WARN > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception > > when > > > >> following the leader > > > >> Jan 20 18:40:10 localhost java.net.ConnectException: Connection > > refused > > > >> Jan 20 18:40:10 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown > > > called > > > >> Jan 20 18:40:10 localhost java.lang.Exception: shutdown Follower > > > >> Jan 20 18:40:10 localhost at > > > >> > > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) > > > >> Jan 20 18:40:10 localhost at > > > >> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850) > > > >> Jan 20 18:40:10 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140 > ] > > - > > > >> Shutting down > > > >> Jan 20 18:40:10 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING > > > >> Jan 20 18:40:10 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - > > New > > > >> election. My id = 2, proposed zxid=0x8005e12e8 > > > >> Jan 20 18:40:10 localhost zookeeper: INFO > > > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 > > > (message > > > >> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe2 (n.round), > > > >> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) > > > >> > > > >> And then the fast leader election fails essentially forever: > > > >> > > > >> Jan 20 18:40:10 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > > > >> Notification time out: 400 > > > >> Jan 20 18:40:10 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > > > >> Notification time out: 800 > > > >> Jan 20 18:40:11 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > > > >> Notification time out: 1600 > > > >> Jan 20 18:40:13 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > > > >> Notification time out: 3200 > > > >> Jan 20 18:40:16 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > > > >> Notification time out: 6400 > > > >> Jan 20 18:40:22 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > > > >> Notification time out: 12800 > > > >> Jan 20 18:40:35 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > > > >> Notification time out: 25600 > > > >> Jan 20 18:41:01 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > > > >> Notification time out: 51200 > > > >> Jan 20 18:41:52 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > > > >> Notification time out: 60000 > > > >> > > > >> Looking at the node elected as the new leader ( > > > zookeeper-5.prod.airtime.com), > > > >> it doesn’t seem like it ever recognizes that it was the leader of > the > > > >> original election. I’m not sure if it’s because it can’t start or > > > something > > > >> else. > > > >> > > > >> Jan 20 18:40:05 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - > > New > > > >> election. My id = 5, proposed zxid=0x8005e12e8 > > > >> Jan 20 18:40:05 localhost zookeeper: INFO > > > >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 > > > (message > > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), > > > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) > > > >> Jan 20 18:40:06 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > > > >> Notification time out: 400 > > > >> Jan 20 18:40:06 localhost zookeeper: INFO > > > >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 > > > (message > > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), > > > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) > > > >> Jan 20 18:40:06 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > > > >> Notification time out: 800 > > > >> Jan 20 18:40:06 localhost zookeeper: INFO > > > >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 > > > (message > > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round), > > > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state) > > > >> Jan 20 18:40:07 localhost zookeeper: INFO > > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > > > >> Notification time out: 1600 > > > >> > > > >> A couple of additional notes: > > > >> > > > >> 1. Restarting zookeeper on all hosts seemed to resolve the issue. > > > >> 2. This was easily reproducible for me in 3 different zookeeper > > > clusters > > > >> in AWS on Jan 20/21. Other things like killing follower hosts > worked > > > fine. > > > >> 3. I wasn't able to reproduce this yesterday (Jan 24th), so I > wonder > > if > > > >> anything network related could make this situation fail, while > > > everything > > > >> else worked fine. > > > > > > > > > > > > >
