Hi Patrick, Thanks for your reply. We use same configuration for all zookeepers:
tickTime=2000 initLimit=5 syncLimit=2 dataDir=/app/lib/zookeeper clientPort=2181 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 server.1=zk1.company.com:2888:3888 server.2=zk2.company.com:2888:3888 server.3=zk3.company.com:2888:3888 Thank you. ________________________________________ From: Patrick Hunt <[email protected]> Sent: Wednesday, May 13, 2015 12:56 AM To: UserZooKeeper Subject: Re: Zookeeper cluster suddenly crashes. Zk version 3.4.6 Hi Oleksandr, I think this is the issue, from the leader: "Closing connection to peer due to transaction timeout." Can you send your configuration from the servers? It seems that you're exceeding the sync limit (see LearnerHandler@687). Perhaps you have some network or other activity at that time of night that's causing the servers to fail to communicate in a timely fashion? (within the sync limit) Patrick On Mon, May 11, 2015 at 10:36 AM, Oleksandr Sokalskyi1 < [email protected]> wrote: > Hi All, > > > > We are using Apache Storm with Apache Zookeepers. > > In our configuration we have Zookeeper cluster with 3 nodes in it. > Everything was well. But recently we faced issue that zookeepers got > restarted during night, although there are no calculations or any load > during at that period. And it continuously happening on each night. > > Zookeeper version is 3.4.6. > > > > Here are the logs for time of failure. > > > > On follower: > > 2015-05-02 08:20:52,603 [myid:3] - INFO > [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started. > > 2015-05-02 08:20:53,044 [myid:3] - INFO > [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed. > > 2015-05-02 08:45:03,416 [myid:3] - WARN [QuorumPeer[myid=3]/0.0.0.0:2181 > :Follower@89] - Exception when following the leader > > java.io.EOFException > > at > java.io.DataInputStream.readInt(DataInputStream.java:392) > > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > > 2015-05-02 08:45:03,417 [myid:3] - INFO [QuorumPeer[myid=3]/0.0.0.0:2181 > :Follower@166] - shutdown called > > java.lang.Exception: shutdown Follower > > at > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) > > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790) > > 2015-05-02 08:45:03,418 [myid:3] - INFO [QuorumPeer[myid=3]/0.0.0.0:2181 > :NIOServerCnxn@1007] - Closed socket connection for client / > 10.49.5.15:41692 which had sessionid 0x34cef2f0370017c > > 2015-05-02 08:45:03,418 [myid:3] - INFO [QuorumPeer[myid=3]/0.0.0.0:2181 > :NIOServerCnxn@1007] - Closed socket connection for client / > 10.49.5.15:41687 which had sessionid 0x34cef2f0370017a > > 2015-05-02 08:45:03,419 [myid:3] - INFO [QuorumPeer[myid=3]/0.0.0.0:2181 > :NIOServerCnxn@1007] - Closed socket connection for client / > 10.49.5.19:40478 which had sessionid 0x34cef2f0370017d > > 2015-05-02 08:45:03,419 [myid:3] - INFO [QuorumPeer[myid=3]/0.0.0.0:2181 > :NIOServerCnxn@1007] - Closed socket connection for client / > 10.49.5.19:40474 which had sessionid 0x34cef2f0370017b > > 2015-05-02 08:45:03,420 [myid:3] - INFO [QuorumPeer[myid=3]/0.0.0.0:2181 > :NIOServerCnxn@1007] - Closed socket connection for client / > 10.49.5.17:55431 which had sessionid 0x34cef2f03700179 > > 2015-05-02 08:45:03,420 [myid:3] - INFO [QuorumPeer[myid=3]/0.0.0.0:2181 > :FollowerZooKeeperServer@139] - Shutting down > > 2015-05-02 08:45:03,420 [myid:3] - INFO [QuorumPeer[myid=3]/0.0.0.0:2181 > :ZooKeeperServer@441] - shutting down > > 2015-05-02 08:45:03,420 [myid:3] - INFO [QuorumPeer[myid=3]/0.0.0.0:2181 > :FollowerRequestProcessor@105] - Shutting down > > 2015-05-02 08:45:03,421 [myid:3] - INFO > [FollowerRequestProcessor:3:FollowerRequestProcessor@95] - > FollowerRequestProcessor exited loop! > > 2015-05-02 08:45:03,421 [myid:3] - INFO [QuorumPeer[myid=3]/0.0.0.0:2181 > :CommitProcessor@181] - Shutting down > > 2015-05-02 08:45:03,422 [myid:3] - INFO [QuorumPeer[myid=3]/0.0.0.0:2181 > :FinalRequestProcessor@415] - shutdown of request processor complete > > 2015-05-02 08:45:03,422 [myid:3] - INFO > [CommitProcessor:3:CommitProcessor@150] - CommitProcessor exited loop! > > 2015-05-02 08:45:03,422 [myid:3] - INFO [QuorumPeer[myid=3]/0.0.0.0:2181 > :SyncRequestProcessor@209] - Shutting down > > > > > > On leader: > > > > 2015-05-02 08:20:45,078 [myid:2] - INFO > [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started. > > 2015-05-02 08:20:45,403 [myid:2] - INFO > [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed. > > 2015-05-02 08:45:03,415 [myid:2] - WARN [QuorumPeer[myid=2]/0.0.0.0:2181 > :LearnerHandler@687] - Closing connection to peer due to transaction > timeout. > > 2015-05-02 08:45:03,415 [myid:2] - WARN [QuorumPeer[myid=2]/0.0.0.0:2181 > :LearnerHandler@687] - Closing connection to peer due to transaction > timeout. > > 2015-05-02 08:45:03,416 [myid:2] - WARN > [LearnerHandler-/148.112.48.199:43202:LearnerHandler@646] - ******* > GOODBYE /148.112.48.199:43202 ******** > > 2015-05-02 08:45:03,417 [myid:2] - WARN > [LearnerHandler-/148.112.48.199:43202:LearnerHandler@658] - Ignoring > unexpected exception > > java.lang.InterruptedException > > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > > at > org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656) > > at > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649) > > 2015-05-02 08:45:03,417 [myid:2] - WARN > [LearnerHandler-/148.112.48.201:33870:LearnerHandler@646] - ******* > GOODBYE /148.112.48.201:33870 ******** > > 2015-05-02 08:45:03,417 [myid:2] - WARN > [LearnerHandler-/148.112.48.201:33870:LearnerHandler@658] - Ignoring > unexpected exception > > java.lang.InterruptedException > > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > > at > org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656) > > at > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649) > > 2015-05-02 08:45:03,536 [myid:2] - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket > connection from /10.49.5.16:53812 > > 2015-05-02 08:45:03,539 [myid:2] - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew > session 0x14cef2ee4c80171 at /10.49.5.16:53812 > > 2015-05-02 08:45:03,539 [myid:2] - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617] - Established session > 0x14cef2ee4c80171 with negotiated timeout 20000 for client / > 10.49.5.16:53812 > > 2015-05-02 08:45:03,582 [myid:2] - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket > connection from /10.49.5.19:37378 > > 2015-05-02 08:45:03,585 [myid:2] - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew > session 0x14cef2ee4c80173 at /10.49.5.19:37378 > > 2015-05-02 08:45:03,586 [myid:2] - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617] - Established session > 0x14cef2ee4c80173 with negotiated timeout 20000 for client / > 10.49.5.19:37378 > > 2015-05-02 08:45:03,606 [myid:2] - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket > connection from /10.49.5.17:40379 > > ............ > > 2015-05-02 08:45:04,379 [myid:2] - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@617] - Established session > 0x14cef2ee4c8016f with negotiated timeout 20000 for client / > 10.49.5.18:52703 > > 2015-05-02 08:45:04,416 [myid:2] - INFO [QuorumPeer[myid=2]/0.0.0.0:2181 > :Leader@493] - Shutting down > > 2015-05-02 08:45:04,416 [myid:2] - INFO [QuorumPeer[myid=2]/0.0.0.0:2181 > :Leader@499] - Shutdown called > > java.lang.Exception: shutdown Leader! reason: Not sufficient followers > synced, only synced with sids: [ 2 ] > > at > org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:499) > > at > org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:474) > > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:799) > > 2015-05-02 08:45:04,417 [myid:2] - INFO [QuorumPeer[myid=2]/0.0.0.0:2181 > :NIOServerCnxn@1007] - Closed socket connection for client / > 10.49.5.17:53780 which had sessionid 0x24cef2ee4cf016a > > > > > > I don't understand why it's happening during nights, when there are no > load. And everything running ok during day. > > Can you help me to understand the root cause. > > Will increasing syncLimit parameter? resolve this issue? > > > > Thanks. > >
