Hi Subramanyeswara, It can take quite a bit of time for the leader to communicate the znode data to the followers during startup. initLimit is the amount of time the leader will allow the followers to download the data from the leader and get to a point where they are ready to serve requests to clients. If this time is exceeded, the leader will close the connection to the follower and the quorum process will restart. Your current initLimit value of 20 sec is typically more than enough time. That said, if you have a slow connection or the data is very large it might be necessary to increase this.
The syncLimit controls the amount of time the quorum members will allow each other to communicate. If the leader doesn't hear from a follower in the syncLimit time, it will drop the follower from the quorum, and vice versa. It's important not to set this parameter too high as it's one of the ways a server detects networking issues and causes recovery to take place (i.e. the follower will drop out of the quorum and try to reconnect, all clients on that follower will disconnect and reconnect to another server, etc...). Can you try increasing both initLimit and syncLimit to 100 on all ZK servers and restarting? Note that it's not good to have syncLimit be such a high value - fixed by ZOOKEEPER-1521. Regards, Kate On Sat, Mar 30, 2013 at 4:12 PM, Subramanyeswara Bhavirisetty <[email protected]> wrote: > > Hi, > > I have 5 box zookeeper quorum. 2 / 5 are unable to deserialize the message > from leader and hence they are not coming up. I tried restarting few times > but still seeing the same issue. This issue is observed when we were trying > to do the following: > > > 1> We have a script which brings down the zookeeper server on the leader. > We are trying to test if new follower takes over once leader is down. > > 2> The script has been running since 2-3 days. Surprisingly we don't see > any issue for the first 2 days but after that we hit the below issue wherein > 2 out 5 zookeeper machines are unable to deserialize the message. > > zookeeper server = 3.4.5 > > Following are the zookeeper settings: > > tickTime=4000 > initLimit=20 > syncLimit=15 > snapCount=100000 > > Logs from Follower > > --------------- > > Thu Mar 28 19:14:14 2013: 2013-03-28 19:14:14,158 - INFO [WorkerReceiver > Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 3 > (n.round), LOOKING (n.state), 2 (n.sid), FOLLOWING (my state) > Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,811 - WARN > [QuorumPeer:/0.0.0.0:2181:Follower@82] - Exception when following the leader > Thu Mar 28 19:14:41 2013: java.io.EOFException > Thu Mar 28 19:14:41 2013: at > java.io.DataInputStream.readInt(DataInputStream.java:392) > Thu Mar 28 19:14:41 2013: at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > Thu Mar 28 19:14:41 2013: at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84) > Thu Mar 28 19:14:41 2013: at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) > Thu Mar 28 19:14:41 2013: at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:148) > Thu Mar 28 19:14:41 2013: at > org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:332) > Thu Mar 28 19:14:41 2013: at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:75) > Thu Mar 28 19:14:41 2013: at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:645) > Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,811 - INFO > [QuorumPeer:/0.0.0.0:2181:Follower@165] - shutdown called > Thu Mar 28 19:14:41 2013: java.lang.Exception: shutdown Follower > Thu Mar 28 19:14:41 2013: at > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:165) > Thu Mar 28 19:14:41 2013: at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:649) > Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,812 - INFO > [QuorumPeer:/0.0.0.0:2181:QuorumPeer@621] - LOOKING > Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,812 - INFO > [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id = 1, > Proposed zxid = 0 > Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,813 - INFO [WorkerReceiver > Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 0 (n.zxid), 3 > (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state) > Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO [WorkerReceiver > Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051 > (n.zxid), 2 (n.round), FOLLOWING (n.state), 3 (n.sid), LOOKING (my state) > Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO [WorkerReceiver > Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051 > (n.zxid), 2 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state) > Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO [WorkerReceiver > Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051 > (n.zxid), 2 (n.round), LEADING (n.state), 4 (n.sid), LOOKING (my state) > Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO [WorkerReceiver > Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051 > (n.zxid), 2 (n.round), FOLLOWING (n.state), 5 (n.sid), FOLLOWING (my state) > Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO > [QuorumPeer:/0.0.0.0:2181:QuorumPeer@643] - FOLLOWING > Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,815 - INFO > [QuorumPeer:/0.0.0.0:2181:ZooKeeperServer@151] - Created server with tickTime > 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir > /home/y/var/zookeeper/version-2 snapdir /home/y/var/zookeeper/version-2 > Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,817 - INFO > [QuorumPeer:/0.0.0.0:2181:Learner@294] - Getting a snapshot from leader > Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,819 - INFO > [QuorumPeer:/0.0.0.0:2181:Learner@325] - Setting leader epoch 15 > Thu Mar 28 19:14:54 2013: 2013-03-28 19:14:54,208 - INFO [WorkerReceiver > Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 3 > (n.round), LOOKING (n.state), 2 (n.sid), FOLLOWING (my state) > > > Logs from Leader > > ---------------------------------------------------- > > Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,416 - ERROR > [LearnerHandler-/98.139.207.73:37505:LearnerHandler@562] - Unexpected > exception > causing shutdown while sock still open > Thu Mar 28 19:18:14 2013: java.net.SocketTimeoutException: Read timed out > Thu Mar 28 19:18:14 2013: at > java.net.SocketInputStream.socketRead0(Native Method) > Thu Mar 28 19:18:14 2013: at > java.net.SocketInputStream.read(SocketInputStream.java:150) > Thu Mar 28 19:18:14 2013: at > java.net.SocketInputStream.read(SocketInputStream.java:121) > Thu Mar 28 19:18:14 2013: at > java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > Thu Mar 28 19:18:14 2013: at > java.io.BufferedInputStream.read(BufferedInputStream.java:254) > Thu Mar 28 19:18:14 2013: at > java.io.DataInputStream.readInt(DataInputStream.java:387) > Thu Mar 28 19:18:14 2013: at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > Thu Mar 28 19:18:14 2013: at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > Thu Mar 28 19:18:14 2013: at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) > Thu Mar 28 19:18:14 2013: at > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450) > Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,417 - WARN > [LearnerHandler-/98.139.207.73:37505:LearnerHandler@575] - ******* GOODBYE > /98.139.207.73:37505 ******** > Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,420 - INFO > [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@273] - > Backward compatibility mode, server id=2 > Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,420 - INFO > [WorkerReceiver[myid=4]:FastLeaderElection@542] - Notification: 2 (n.leader), > 0x0 (n.zxid), 0x3 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), > LEADING (my state) > Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,423 - INFO > [LearnerHandler-/98.139.207.73:37511:LearnerHandler@263] - Follower sid: 2 : > info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@265d646b > Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,424 - INFO > [LearnerHandler-/98.139.207.73:37511:LearnerHandler@318] - Synchronizing with > Follower sid: 2 maxCommittedLog=0x15000000d8 minCommittedLog=0x1400000001 > peerLastZxid=0x0 > Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,424 - WARN > [LearnerHandler-/98.139.207.73:37511:LearnerHandler@379] - Unhandled proposal > scenario > Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,424 - INFO > [LearnerHandler-/98.139.207.73:37511:LearnerHandler@395] - Sending SNAP > Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,424 - INFO > [LearnerHandler-/98.139.207.73:37511:LearnerHandler@419] - Sending snapshot > last zxid of peer is 0x0 zxid of leader is 0x15000000d8sent zxid of db as > 0x15000000d8 > > Can you please help how we can fix this issue?
