Hi, I have 5 box zookeeper quorum. 2 / 5 are unable to deserialize the message from leader and hence they are not coming up. I tried restarting few times but still seeing the same issue. This issue is observed when we were trying to do the following:
1> We have a script which brings down the zookeeper server on the leader. We are trying to test if new follower takes over once leader is down. 2> The script has been running since 2-3 days. Surprisingly we don't see any issue for the first 2 days but after that we hit the below issue wherein 2 out 5 zookeeper machines are unable to deserialize the message. zookeeper server = 3.4.5 Following are the zookeeper settings: tickTime=4000 initLimit=20 syncLimit=15 snapCount=100000 Logs from Follower --------------- Thu Mar 28 19:14:14 2013: 2013-03-28 19:14:14,158 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 3 (n.round), LOOKING (n.state), 2 (n.sid), FOLLOWING (my state) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,811 - WARN [QuorumPeer:/0.0.0.0:2181:Follower@82] - Exception when following the leader Thu Mar 28 19:14:41 2013: java.io.EOFException Thu Mar 28 19:14:41 2013: at java.io.DataInputStream.readInt(DataInputStream.java:392) Thu Mar 28 19:14:41 2013: at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) Thu Mar 28 19:14:41 2013: at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84) Thu Mar 28 19:14:41 2013: at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) Thu Mar 28 19:14:41 2013: at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:148) Thu Mar 28 19:14:41 2013: at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:332) Thu Mar 28 19:14:41 2013: at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:75) Thu Mar 28 19:14:41 2013: at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:645) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,811 - INFO [QuorumPeer:/0.0.0.0:2181:Follower@165] - shutdown called Thu Mar 28 19:14:41 2013: java.lang.Exception: shutdown Follower Thu Mar 28 19:14:41 2013: at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:165) Thu Mar 28 19:14:41 2013: at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:649) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,812 - INFO [QuorumPeer:/0.0.0.0:2181:QuorumPeer@621] - LOOKING Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,812 - INFO [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id = 1, Proposed zxid = 0 Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,813 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 0 (n.zxid), 3 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051 (n.zxid), 2 (n.round), FOLLOWING (n.state), 3 (n.sid), LOOKING (my state) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051 (n.zxid), 2 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051 (n.zxid), 2 (n.round), LEADING (n.state), 4 (n.sid), LOOKING (my state) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051 (n.zxid), 2 (n.round), FOLLOWING (n.state), 5 (n.sid), FOLLOWING (my state) Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO [QuorumPeer:/0.0.0.0:2181:QuorumPeer@643] - FOLLOWING Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,815 - INFO [QuorumPeer:/0.0.0.0:2181:ZooKeeperServer@151] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /home/y/var/zookeeper/version-2 snapdir /home/y/var/zookeeper/version-2 Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,817 - INFO [QuorumPeer:/0.0.0.0:2181:Learner@294] - Getting a snapshot from leader Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,819 - INFO [QuorumPeer:/0.0.0.0:2181:Learner@325] - Setting leader epoch 15 Thu Mar 28 19:14:54 2013: 2013-03-28 19:14:54,208 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 3 (n.round), LOOKING (n.state), 2 (n.sid), FOLLOWING (my state) Logs from Leader ---------------------------------------------------- Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,416 - ERROR [LearnerHandler-/98.139.207.73:37505:LearnerHandler@562] - Unexpected exception causing shutdown while sock still open Thu Mar 28 19:18:14 2013: java.net.SocketTimeoutException: Read timed out Thu Mar 28 19:18:14 2013: at java.net.SocketInputStream.socketRead0(Native Method) Thu Mar 28 19:18:14 2013: at java.net.SocketInputStream.read(SocketInputStream.java:150) Thu Mar 28 19:18:14 2013: at java.net.SocketInputStream.read(SocketInputStream.java:121) Thu Mar 28 19:18:14 2013: at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) Thu Mar 28 19:18:14 2013: at java.io.BufferedInputStream.read(BufferedInputStream.java:254) Thu Mar 28 19:18:14 2013: at java.io.DataInputStream.readInt(DataInputStream.java:387) Thu Mar 28 19:18:14 2013: at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) Thu Mar 28 19:18:14 2013: at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) Thu Mar 28 19:18:14 2013: at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) Thu Mar 28 19:18:14 2013: at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450) Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,417 - WARN [LearnerHandler-/98.139.207.73:37505:LearnerHandler@575] - ******* GOODBYE /98.139.207.73:37505 ******** Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,420 - INFO [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@273] - Backward compatibility mode, server id=2 Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,420 - INFO [WorkerReceiver[myid=4]:FastLeaderElection@542] - Notification: 2 (n.leader), 0x0 (n.zxid), 0x3 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LEADING (my state) Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,423 - INFO [LearnerHandler-/98.139.207.73:37511:LearnerHandler@263] - Follower sid: 2 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@265d646b Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,424 - INFO [LearnerHandler-/98.139.207.73:37511:LearnerHandler@318] - Synchronizing with Follower sid: 2 maxCommittedLog=0x15000000d8 minCommittedLog=0x1400000001 peerLastZxid=0x0 Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,424 - WARN [LearnerHandler-/98.139.207.73:37511:LearnerHandler@379] - Unhandled proposal scenario Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,424 - INFO [LearnerHandler-/98.139.207.73:37511:LearnerHandler@395] - Sending SNAP Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,424 - INFO [LearnerHandler-/98.139.207.73:37511:LearnerHandler@419] - Sending snapshot last zxid of peer is 0x0 zxid of leader is 0x15000000d8sent zxid of db as 0x15000000d8 Can you please help how we can fix this issue?
