We are running a test on a 3-node Zookeeper setup where we power off one of the nodes for 30 minutes and then power it back on. After the power on, the leader (who was not powered off) gets the following errors and then loops with NullPointerException indefinitely, preventing the restarting follower from joining the quorum.
I could not find a Jira around this when searching for NullPointerException and/or ZKDatabase. It doesn't happen all the time, so we don't believe its a configuration problem. Has anyone seen this before? Thanks! ~Jared ---------------------------------------------------------------------------------------------------------------------------------- May 25 05:43:13.501866 zookeeper - INFO [/11.46.65.86:2183 :QuorumCnxManager$Listener@550] - Received connection request / 11.46.65.76:39661 May 25 05:43:13.502395 zookeeper - WARN [SendWorker:1:QuorumCnxManager$SendWorker@738] - Interrupted while waiting for message on queue May 25 05:43:13.502395 *java.lang.InterruptedException* May 25 05:43:13.502404 localhost at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) May 25 05:43:13.502422 localhost at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095) May 25 05:43:13.502437 localhost at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389) May 25 05:43:13.502451 localhost at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:890) May 25 05:43:13.502465 localhost at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:60) May 25 05:43:13.502479 localhost at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:726) May 25 05:43:13.502565 zookeeper - WARN [SendWorker:1:QuorumCnxManager$SendWorker@747] - Send worker leaving thread id 1 my id = 8 May 25 05:43:13.503007 zookeeper - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker@821] - Connection broken for id 1, my id = 8, error = May 25 05:43:13.503007 localhost java.net.SocketException: Socket closed May 25 05:43:13.503017 localhost at java.net.SocketInputStream.socketRead0(Native Method) May 25 05:43:13.503048 localhost at java.net.SocketInputStream.read(SocketInputStream.java:152) May 25 05:43:13.503048 localhost at java.net.SocketInputStream.read(SocketInputStream.java:122) May 25 05:43:13.503173 localhost at java.net.SocketInputStream.read(SocketInputStream.java:210) May 25 05:43:13.503173 localhost at java.io.DataInputStream.readInt(DataInputStream.java:387) May 25 05:43:13.503179 localhost at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:806) May 25 05:43:13.503295 zookeeper - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker@824] - Interrupting SendWorker May 25 05:43:13.507255 zookeeper - INFO [WorkerReceiver[myid=8]:FastLeaderElection@606] - Notification: 1 (n.leader), 0xb00000ba7 (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 (n.sid), 0xb (n.peerEPoch), LEADING (my state)8000004f3 (n.config version) May 25 05:43:13.509960 zookeeper - INFO [WorkerReceiver[myid=8]:FastLeaderElection@606] - Notification: 1 (n.leader), 0xb00000ba7 (n.zxid), 0x2 (n.round), LOOKING (n.state), 1 (n.sid), 0xb (n.peerEPoch), LEADING (my state)8000004f3 (n.config version) May 25 05:43:13.529574 zookeeper - INFO [LearnerHandler-/11.46.65.76:51281 :LearnerHandler@359] - Follower sid: 1 : info : 11.46.65.76:2182 :2183:participant;11.46.65.76:2181 May 25 05:43:13.529782 zookeeper - INFO [LearnerHandler-/11.46.65.76:51281 :LearnerHandler@646] - Synchronizing with Follower sid: 1 maxCommittedLog=0xb000011b1 minCommittedLog=0xb00000fbd lastProcessedZxid=0xb000011b1 peerLastZxid=0xb00000ba7 May 25 05:43:13.529986 zookeeper - WARN [LearnerHandler-/11.46.65.76:51281 :LearnerHandler@582] - ******* GOODBYE /11.46.65.76:51281 ******** May 25 05:43:13.530513 zookeeper - ERROR [LearnerHandler-/11.46.65.76:51281 :NIOServerCnxnFactory$1@92] - Thread LearnerHandler Socket[addr=/11.46.65.76,port=51281,localport=2182] tickOfNextAckDeadline:2389 synced?:true queuedPacketLength:1 died May 25 05:43:13.530513 *java.lang.NullPointerException* May 25 05:43:13.530538 localhost at org.apache.zookeeper.server.ZKDatabase.calculateTxnLogSizeLimit(ZKDatabase.java:276) May 25 05:43:13.530538 localhost at org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower(LearnerHandler.java:716) May 25 05:43:13.530540 localhost at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:404) May 25 05:43:13.782834 zookeeper - INFO [WorkerReceiver[myid=8]:FastLeaderElection@606] - Notification: 1 (n.leader), 0xb00000ba7 (n.zxid), 0x0 (n.round), LOOKING (n.state), 1 (n.sid), 0xb (n.peerEPoch), LEADING (my state)8000004f3 (n.config version) May 25 05:43:13.785988 zookeeper - INFO [LearnerHandler-/11.46.65.76:51282 :LearnerHandler@359] - Follower sid: 1 : info : 11.46.65.76:2182 :2183:participant;11.46.65.76:2181 May 25 05:43:13.786310 zookeeper - INFO [LearnerHandler-/11.46.65.76:51282 :LearnerHandler@646] - Synchronizing with Follower sid: 1 maxCommittedLog=0xb000011b1 minCommittedLog=0xb00000fbd lastProcessedZxid=0xb000011b1 peerLastZxid=0xb00000ba7 May 25 05:43:13.786611 zookeeper - WARN [LearnerHandler-/11.46.65.76:51282 :LearnerHandler@582] - ******* GOODBYE /11.46.65.76:51282 ******** May 25 05:43:13.787048 zookeeper - ERROR [LearnerHandler-/11.46.65.76:51282 :NIOServerCnxnFactory$1@92] - Thread LearnerHandler Socket[addr=/11.46.65.76,port=51282,localport=2182] tickOfNextAckDeadline:2389 synced?:true queuedPacketLength:1 died May 25 05:43:13.787048 java.lang.NullPointerException May 25 05:43:13.787092 localhost at org.apache.zookeeper.server.ZKDatabase.calculateTxnLogSizeLimit(ZKDatabase.java:276) May 25 05:43:13.787116 localhost at org.apache.zookeeper.server.quorum.LearnerHandler.syncFollower(LearnerHandler.java:716) May 25 05:43:13.787138 localhost at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:404)
