Hello Jared, After the leader goes into this state, what does its dataDir look like? Do you see any files named snapshot.*? Would you be able to share ls output or similar showing those files and their sizes?
It seems the only way this condition could occur is if a valid snapshot could not be found. The relevant validation code is in FileSnap#findNValidSnapshots. This can return an empty list if no valid snapshots are found. Then, further up the call stack, FileSnap#findMostRecentSnapshot would map empty list to null, and ZKDatabase#calculateTxnLogSizeLimit isn't equipped to handle that null. --Chris Nauroth On 5/28/15, 3:34 PM, "Jared Cantwell" <[email protected]> wrote: >We are running a test on a 3-node Zookeeper setup where we power off one >of >the nodes for 30 minutes and then power it back on. After the power on, >the leader (who was not powered off) gets the following errors and then >loops with NullPointerException indefinitely, preventing the restarting >follower from joining the quorum. > >I could not find a Jira around this when searching for >NullPointerException >and/or ZKDatabase. It doesn't happen all the time, so we don't believe >its >a configuration problem. Has anyone seen this before? > >Thanks! >~Jared > >-------------------------------------------------------------------------- >-------------------------------------------------------- >May 25 05:43:13.501866 zookeeper - INFO [/11.46.65.86:2183 >:QuorumCnxManager$Listener@550] - Received connection request / >11.46.65.76:39661 >May 25 05:43:13.502395 zookeeper - WARN > [SendWorker:1:QuorumCnxManager$SendWorker@738] - Interrupted while >waiting >for message on queue >May 25 05:43:13.502395 *java.lang.InterruptedException* >May 25 05:43:13.502404 localhost at >java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.repo >rtInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) >May 25 05:43:13.502422 localhost at >java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai >tNanos(AbstractQueuedSynchronizer.java:2095) >May 25 05:43:13.502437 localhost at >java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389) >May 25 05:43:13.502451 localhost at >org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCn >xManager.java:890) >May 25 05:43:13.502465 localhost at >org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxMa >nager.java:60) >May 25 05:43:13.502479 localhost at >org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumC >nxManager.java:726) >May 25 05:43:13.502565 zookeeper - WARN > [SendWorker:1:QuorumCnxManager$SendWorker@747] - Send worker leaving >thread id 1 my id = 8 >May 25 05:43:13.503007 zookeeper - WARN > [RecvWorker:1:QuorumCnxManager$RecvWorker@821] - Connection broken for id >1, my id = 8, error = >May 25 05:43:13.503007 localhost java.net.SocketException: Socket closed >May 25 05:43:13.503017 localhost at >java.net.SocketInputStream.socketRead0(Native Method) >May 25 05:43:13.503048 localhost at >java.net.SocketInputStream.read(SocketInputStream.java:152) >May 25 05:43:13.503048 localhost at >java.net.SocketInputStream.read(SocketInputStream.java:122) >May 25 05:43:13.503173 localhost at >java.net.SocketInputStream.read(SocketInputStream.java:210) >May 25 05:43:13.503173 localhost at >java.io.DataInputStream.readInt(DataInputStream.java:387) >May 25 05:43:13.503179 localhost at >org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumC >nxManager.java:806) >May 25 05:43:13.503295 zookeeper - WARN > [RecvWorker:1:QuorumCnxManager$RecvWorker@824] - Interrupting SendWorker >May 25 05:43:13.507255 zookeeper - INFO > [WorkerReceiver[myid=8]:FastLeaderElection@606] - Notification: 1 >(n.leader), 0xb00000ba7 (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 >(n.sid), 0xb (n.peerEPoch), LEADING (my state)8000004f3 (n.config version) >May 25 05:43:13.509960 zookeeper - INFO > [WorkerReceiver[myid=8]:FastLeaderElection@606] - Notification: 1 >(n.leader), 0xb00000ba7 (n.zxid), 0x2 (n.round), LOOKING (n.state), 1 >(n.sid), 0xb (n.peerEPoch), LEADING (my state)8000004f3 (n.config version) >May 25 05:43:13.529574 zookeeper - INFO >[LearnerHandler-/11.46.65.76:51281 >:LearnerHandler@359] - Follower sid: 1 : info : 11.46.65.76:2182 >:2183:participant;11.46.65.76:2181 >May 25 05:43:13.529782 zookeeper - INFO >[LearnerHandler-/11.46.65.76:51281 >:LearnerHandler@646] - Synchronizing with Follower sid: 1 >maxCommittedLog=0xb000011b1 minCommittedLog=0xb00000fbd >lastProcessedZxid=0xb000011b1 peerLastZxid=0xb00000ba7 >May 25 05:43:13.529986 zookeeper - WARN >[LearnerHandler-/11.46.65.76:51281 >:LearnerHandler@582] - ******* GOODBYE /11.46.65.76:51281 ******** >May 25 05:43:13.530513 zookeeper - ERROR >[LearnerHandler-/11.46.65.76:51281 >:NIOServerCnxnFactory$1@92] - Thread LearnerHandler >Socket[addr=/11.46.65.76,port=51281,localport=2182] >tickOfNextAckDeadline:2389 synced?:true queuedPacketLength:1 died >May 25 05:43:13.530513 *java.lang.NullPointerException* >May 25 05:43:13.530538 localhost at >org.apache.zookeeper.server.ZKDatabase.calculateTxnLogSizeLimit(ZKDatabase >.java:276) >May 25 05:43:13.530538 localhost at >org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower(LearnerHand >ler.java:716) >May 25 05:43:13.530540 localhost at >org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java: >404) >May 25 05:43:13.782834 zookeeper - INFO > [WorkerReceiver[myid=8]:FastLeaderElection@606] - Notification: 1 >(n.leader), 0xb00000ba7 (n.zxid), 0x0 (n.round), LOOKING (n.state), 1 >(n.sid), 0xb (n.peerEPoch), LEADING (my state)8000004f3 (n.config version) >May 25 05:43:13.785988 zookeeper - INFO >[LearnerHandler-/11.46.65.76:51282 >:LearnerHandler@359] - Follower sid: 1 : info : 11.46.65.76:2182 >:2183:participant;11.46.65.76:2181 >May 25 05:43:13.786310 zookeeper - INFO >[LearnerHandler-/11.46.65.76:51282 >:LearnerHandler@646] - Synchronizing with Follower sid: 1 >maxCommittedLog=0xb000011b1 minCommittedLog=0xb00000fbd >lastProcessedZxid=0xb000011b1 peerLastZxid=0xb00000ba7 >May 25 05:43:13.786611 zookeeper - WARN >[LearnerHandler-/11.46.65.76:51282 >:LearnerHandler@582] - ******* GOODBYE /11.46.65.76:51282 ******** >May 25 05:43:13.787048 zookeeper - ERROR >[LearnerHandler-/11.46.65.76:51282 >:NIOServerCnxnFactory$1@92] - Thread LearnerHandler >Socket[addr=/11.46.65.76,port=51282,localport=2182] >tickOfNextAckDeadline:2389 synced?:true queuedPacketLength:1 died >May 25 05:43:13.787048 java.lang.NullPointerException >May 25 05:43:13.787092 localhost at >org.apache.zookeeper.server.ZKDatabase.calculateTxnLogSizeLimit(ZKDatabase >.java:276) >May 25 05:43:13.787116 localhost at >org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower(LearnerHand >ler.java:716) >May 25 05:43:13.787138 localhost at >org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java: >404)
