Hi, I have 5 box zookeeper quorum. 2 / 5 are unable to deserialize the message from leader and hence they are not coming up. I tried restarting few times but still seeing the same issue. This issue is observed when we were trying to do the following:
1> We have a script which brings down the zookeeper server on the leader. We are trying to test if new follower takes over once leader is down. 2> The script has been running since 2-3 days. Surprisingly we don't see any issue for the first 2 days but after that we hit the below issue wherein 2 out 5 zookeeper machines are unable to deserialize the message. zookeeper server = 3.4.5 Following are the zookeeper settings: tickTime=4000 initLimit=20 syncLimit=15 snapCount=100000 Logs from Follower --------------- Thu Mar 28 19:14:14 2013: 2013-03-28 19:14:14,158 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 3 (n.round), LOOKING (n.state), 2 (n.sid), FOLLOWING (my state) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,811 - WARN [QuorumPeer:/0.0.0.0:2181:Follower@82] - Exception when following the leader Thu Mar 28 19:14:41 2013: java.io.EOFException Thu Mar 28 19:14:41 2013: at java.io.DataInputStream.readInt(DataInputStream.java:392) Thu Mar 28 19:14:41 2013: at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) Thu Mar 28 19:14:41 2013: at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84) Thu Mar 28 19:14:41 2013: at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) Thu Mar 28 19:14:41 2013: at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:148) Thu Mar 28 19:14:41 2013: at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:332) Thu Mar 28 19:14:41 2013: at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:75) Thu Mar 28 19:14:41 2013: at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:645) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,811 - INFO [QuorumPeer:/0.0.0.0:2181:Follower@165] - shutdown called Thu Mar 28 19:14:41 2013: java.lang.Exception: shutdown Follower Thu Mar 28 19:14:41 2013: at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:165) Thu Mar 28 19:14:41 2013: at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:649) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,812 - INFO [QuorumPeer:/0.0.0.0:2181:QuorumPeer@621] - LOOKING Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,812 - INFO [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id = 1, Proposed zxid = 0 Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,813 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 0 (n.zxid), 3 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051 (n.zxid), 2 (n.round), FOLLOWING (n.state), 3 (n.sid), LOOKING (my state) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051 (n.zxid), 2 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051 (n.zxid), 2 (n.round), LEADING (n.state), 4 (n.sid), LOOKING (my state) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051 (n.zxid), 2 (n.round), FOLLOWING (n.state), 5 (n.sid), FOLLOWING (my state) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO [QuorumPeer:/0.0.0.0:2181:QuorumPeer@643] - FOLLOWING Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,815 - INFO [QuorumPeer:/0.0.0.0:2181:ZooKeeperServer@151] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /home/y/var/zookeeper/version-2 snapdir /home/y/var/zookeeper/version-2 Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,817 - INFO [QuorumPeer:/0.0.0.0:2181:Learner@294] - Getting a snapshot from leader Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,819 - INFO [QuorumPeer:/0.0.0.0:2181:Learner@325] - Setting leader epoch 15 Thu Mar 28 19:14:54 2013: 2013-03-28 19:14:54,208 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 3 (n.round), LOOKING (n.state), 2 (n.sid), FOLLOWING (my state) Logs from Leader ---------------------------------------------------- Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,416 - ERROR [LearnerHandler-/98.139.207.73:37505:LearnerHandler@562] - Unexpected exception causing shutdown while sock still open Thu Mar 28 19:18:14 2013: java.net.SocketTimeoutException: Read timed out Thu Mar 28 19:18:14 2013: at java.net.SocketInputStream.socketRead0(Native Method) Thu Mar 28 19:18:14 2013: at java.net.SocketInputStream.read(SocketInputStream.java:150) Thu Mar 28 19:18:14 2013: at java.net.SocketInputStream.read(SocketInputStream.java:121) Thu Mar 28 19:18:14 2013: at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) Thu Mar 28 19:18:14 2013: at java.io.BufferedInputStream.read(BufferedInputStream.java:254) Thu Mar 28 19:18:14 2013: at java.io.DataInputStream.readInt(DataInputStream.java:387) Thu Mar 28 19:18:14 2013: at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) Thu Mar 28 19:18:14 2013: at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) Thu Mar 28 19:18:14 2013: at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) Thu Mar 28 19:18:14 2013: at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450) Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,417 - WARN [LearnerHandler-/98.139.207.73:37505:LearnerHandler@575] - ******* GOODBYE /98.139.207.73:37505 ******** Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,420 - INFO [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@273] - Backward compatibility mode, server id=2 Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,420 - INFO [WorkerReceiver[myid=4]:FastLeaderElection@542] - Notification: 2 (n.leader), 0x0 (n.zxid), 0x3 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LEADING (my state) Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,423 - INFO [LearnerHandler-/98.139.207.73:37511:LearnerHandler@263] - Follower sid: 2 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@265d646b<mailto:org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@265d646b> Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,424 - INFO [LearnerHandler-/98.139.207.73:37511:LearnerHandler@318] - Synchronizing with Follower sid: 2 maxCommittedLog=0x15000000d8 minCommittedLog=0x1400000001 peerLastZxid=0x0 Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,424 - WARN [LearnerHandler-/98.139.207.73:37511:LearnerHandler@379] - Unhandled proposal scenario Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,424 - INFO [LearnerHandler-/98.139.207.73:37511:LearnerHandler@395] - Sending SNAP Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,424 - INFO [LearnerHandler-/98.139.207.73:37511:LearnerHandler@419] - Sending snapshot last zxid of peer is 0x0 zxid of leader is 0x15000000d8sent zxid of db as 0x15000000d8 Can you please help how we can fix this issue or if this behavior is expected? ~Subramanyam
