Thanks Chris. I will collect the snapshot 'ls' and report back. We don't collect this information on failure by default, so I'll have to reproduce.
~Jared On Thu, May 28, 2015 at 5:41 PM, Chris Nauroth <[email protected]> wrote: > Hello Jared, > > After the leader goes into this state, what does its dataDir look like? > Do you see any files named snapshot.*? Would you be able to share ls > output or similar showing those files and their sizes? > > It seems the only way this condition could occur is if a valid snapshot > could not be found. The relevant validation code is in > FileSnap#findNValidSnapshots. This can return an empty list if no valid > snapshots are found. Then, further up the call stack, > FileSnap#findMostRecentSnapshot would map empty list to null, and > ZKDatabase#calculateTxnLogSizeLimit isn't equipped to handle that null. > > --Chris Nauroth > > > > > On 5/28/15, 3:34 PM, "Jared Cantwell" <[email protected]> wrote: > > >We are running a test on a 3-node Zookeeper setup where we power off one > >of > >the nodes for 30 minutes and then power it back on. After the power on, > >the leader (who was not powered off) gets the following errors and then > >loops with NullPointerException indefinitely, preventing the restarting > >follower from joining the quorum. > > > >I could not find a Jira around this when searching for > >NullPointerException > >and/or ZKDatabase. It doesn't happen all the time, so we don't believe > >its > >a configuration problem. Has anyone seen this before? > > > >Thanks! > >~Jared > > > >-------------------------------------------------------------------------- > >-------------------------------------------------------- > >May 25 05:43:13.501866 zookeeper - INFO [/11.46.65.86:2183 > >:QuorumCnxManager$Listener@550] - Received connection request / > >11.46.65.76:39661 > >May 25 05:43:13.502395 zookeeper - WARN > > [SendWorker:1:QuorumCnxManager$SendWorker@738] - Interrupted while > >waiting > >for message on queue > >May 25 05:43:13.502395 *java.lang.InterruptedException* > >May 25 05:43:13.502404 localhost at > >java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.repo > >rtInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) > >May 25 05:43:13.502422 localhost at > >java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awai > >tNanos(AbstractQueuedSynchronizer.java:2095) > >May 25 05:43:13.502437 localhost at > >java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389) > >May 25 05:43:13.502451 localhost at > >org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCn > >xManager.java:890) > >May 25 05:43:13.502465 localhost at > >org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxMa > >nager.java:60) > >May 25 05:43:13.502479 localhost at > >org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumC > >nxManager.java:726) > >May 25 05:43:13.502565 zookeeper - WARN > > [SendWorker:1:QuorumCnxManager$SendWorker@747] - Send worker leaving > >thread id 1 my id = 8 > >May 25 05:43:13.503007 zookeeper - WARN > > [RecvWorker:1:QuorumCnxManager$RecvWorker@821] - Connection broken for > id > >1, my id = 8, error = > >May 25 05:43:13.503007 localhost java.net.SocketException: Socket closed > >May 25 05:43:13.503017 localhost at > >java.net.SocketInputStream.socketRead0(Native Method) > >May 25 05:43:13.503048 localhost at > >java.net.SocketInputStream.read(SocketInputStream.java:152) > >May 25 05:43:13.503048 localhost at > >java.net.SocketInputStream.read(SocketInputStream.java:122) > >May 25 05:43:13.503173 localhost at > >java.net.SocketInputStream.read(SocketInputStream.java:210) > >May 25 05:43:13.503173 localhost at > >java.io.DataInputStream.readInt(DataInputStream.java:387) > >May 25 05:43:13.503179 localhost at > >org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumC > >nxManager.java:806) > >May 25 05:43:13.503295 zookeeper - WARN > > [RecvWorker:1:QuorumCnxManager$RecvWorker@824] - Interrupting SendWorker > >May 25 05:43:13.507255 zookeeper - INFO > > [WorkerReceiver[myid=8]:FastLeaderElection@606] - Notification: 1 > >(n.leader), 0xb00000ba7 (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 > >(n.sid), 0xb (n.peerEPoch), LEADING (my state)8000004f3 (n.config version) > >May 25 05:43:13.509960 zookeeper - INFO > > [WorkerReceiver[myid=8]:FastLeaderElection@606] - Notification: 1 > >(n.leader), 0xb00000ba7 (n.zxid), 0x2 (n.round), LOOKING (n.state), 1 > >(n.sid), 0xb (n.peerEPoch), LEADING (my state)8000004f3 (n.config version) > >May 25 05:43:13.529574 zookeeper - INFO > >[LearnerHandler-/11.46.65.76:51281 > >:LearnerHandler@359] - Follower sid: 1 : info : 11.46.65.76:2182 > >:2183:participant;11.46.65.76:2181 > >May 25 05:43:13.529782 zookeeper - INFO > >[LearnerHandler-/11.46.65.76:51281 > >:LearnerHandler@646] - Synchronizing with Follower sid: 1 > >maxCommittedLog=0xb000011b1 minCommittedLog=0xb00000fbd > >lastProcessedZxid=0xb000011b1 peerLastZxid=0xb00000ba7 > >May 25 05:43:13.529986 zookeeper - WARN > >[LearnerHandler-/11.46.65.76:51281 > >:LearnerHandler@582] - ******* GOODBYE /11.46.65.76:51281 ******** > >May 25 05:43:13.530513 zookeeper - ERROR > >[LearnerHandler-/11.46.65.76:51281 > >:NIOServerCnxnFactory$1@92] - Thread LearnerHandler > >Socket[addr=/11.46.65.76,port=51281,localport=2182] > >tickOfNextAckDeadline:2389 synced?:true queuedPacketLength:1 died > >May 25 05:43:13.530513 *java.lang.NullPointerException* > >May 25 05:43:13.530538 localhost at > >org.apache.zookeeper.server.ZKDatabase.calculateTxnLogSizeLimit(ZKDatabase > >.java:276) > >May 25 05:43:13.530538 localhost at > >org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower(LearnerHand > >ler.java:716) > >May 25 05:43:13.530540 localhost at > >org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java: > >404) > >May 25 05:43:13.782834 zookeeper - INFO > > [WorkerReceiver[myid=8]:FastLeaderElection@606] - Notification: 1 > >(n.leader), 0xb00000ba7 (n.zxid), 0x0 (n.round), LOOKING (n.state), 1 > >(n.sid), 0xb (n.peerEPoch), LEADING (my state)8000004f3 (n.config version) > >May 25 05:43:13.785988 zookeeper - INFO > >[LearnerHandler-/11.46.65.76:51282 > >:LearnerHandler@359] - Follower sid: 1 : info : 11.46.65.76:2182 > >:2183:participant;11.46.65.76:2181 > >May 25 05:43:13.786310 zookeeper - INFO > >[LearnerHandler-/11.46.65.76:51282 > >:LearnerHandler@646] - Synchronizing with Follower sid: 1 > >maxCommittedLog=0xb000011b1 minCommittedLog=0xb00000fbd > >lastProcessedZxid=0xb000011b1 peerLastZxid=0xb00000ba7 > >May 25 05:43:13.786611 zookeeper - WARN > >[LearnerHandler-/11.46.65.76:51282 > >:LearnerHandler@582] - ******* GOODBYE /11.46.65.76:51282 ******** > >May 25 05:43:13.787048 zookeeper - ERROR > >[LearnerHandler-/11.46.65.76:51282 > >:NIOServerCnxnFactory$1@92] - Thread LearnerHandler > >Socket[addr=/11.46.65.76,port=51282,localport=2182] > >tickOfNextAckDeadline:2389 synced?:true queuedPacketLength:1 died > >May 25 05:43:13.787048 java.lang.NullPointerException > >May 25 05:43:13.787092 localhost at > >org.apache.zookeeper.server.ZKDatabase.calculateTxnLogSizeLimit(ZKDatabase > >.java:276) > >May 25 05:43:13.787116 localhost at > >org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower(LearnerHand > >ler.java:716) > >May 25 05:43:13.787138 localhost at > >org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java: > >404) > >
