a disk write has taken too long as well: I will check on this, thanks for finding it. zk logs really bit diff to understand for me.
On Thu, Jan 25, 2018 at 10:19 AM, upendar devu <[email protected]> wrote: > Thanks for sharing analysis , the instances running on EC2 instances and > we have kafka,zk,storm and es instances as well but not seen such error in > those components if there is network latency then there should be socket > error in other components as data is being processed every sec. > > Lets hear from zookeeper dev team , hope they will respond > > On Thu, Jan 25, 2018 at 6:39 AM, Andor Molnar <[email protected]> wrote: > >> No, this is not the bug I was thinking of. >> >> Looks like your network connection is poor between the leader and the >> follower which the logs was attached. Do you have any other network >> monitoring tools in place or do you see any network related error messages >> in your kernel logs? >> Follower lost the connection to the leader: >> 2018-01-23 07:40:21,709 [myid:3] - WARN >> [SyncThread:3:SendAckRequestProcessor@64] - Closing connection to leader, >> exception during packet send >> >> ...and took ages to recover: 944 secs!! >> 2018-01-23 07:56:05,742 [myid:3] - INFO >> [QuorumPeer[myid=3]/XX.XX.XX:2181:Follower@63] - FOLLOWING - LEADER >> ELECTION TOOK - 944020 >> >> Additionally, a disk write has taken too long as well: >> 2018-01-23 07:40:21,706 [myid:3] - WARN [SyncThread:3:FileTxnLog@334] - >> fsync-ing the write ahead log in SyncThread:3 took 13638ms which will >> adversely effect operation latency. See the ZooKeeper troubleshooting >> guide >> >> I believe this stuff is worth to take a closer look, though I'm not an >> expert of Zookeeper, maybe somebody else can give you more insight. >> >> Regards, >> Andor >> >> >> On Wed, Jan 24, 2018 at 7:47 PM, upendar devu <[email protected]> >> wrote: >> >> > Thanks Andor for the reply. >> > >> > We are using zookeeper version 3.4.6; we have 3 instances ; please see >> > below configuration , I believe we are using default configuration and >> > attached zk log and issue is occurred at First Occurrence: 01/23/2018 >> > 07:42:22 Last Occurrence: 01/23/2018 07:43:22 >> > >> > >> > The issue occurs 3 to 4 times in a month and get auto resolved in few >> mins >> > but this is really annoying our operations team. please let me know if >> you >> > need any additional details >> > >> > >> > >> > # The number of milliseconds of each tick >> > tickTime=2000 >> > >> > # The number of ticks that the initial synchronization phase can take >> > initLimit=10 >> > >> > # The number of ticks that can pass between sending a request and >> getting >> > an acknowledgement >> > syncLimit=5 >> > >> > # The directory where the snapshot is stored. >> > dataDir=/opt/zookeeper/current/data >> > >> > # The port at which the clients will connect >> > clientPort=2181 >> > >> > # This is the list of Zookeeper peers: >> > server.1=zookeeper1:2888:3888 >> > server.2=zookeeper2:2888:3888 >> > server.3=zookeeper3:2888:3888 >> > >> > # The interface IP address(es) from which zookeeper will listen from >> > clientPortAddress=<IP of zk> >> > >> > # The number of snapshots to retain in dataDir >> > autopurge.snapRetainCount=3 >> > >> > # Purge task interval in hours >> > # Set to "0" to disable auto purge feature >> > autopurge.purgeInterval=1 >> > >> > >> > On Wed, Jan 24, 2018 at 4:51 AM, Andor Molnar <[email protected]> >> wrote: >> > >> >> Hi Upendar, >> >> >> >> Thanks for reporting the issue. >> >> I've a gut feeling which existing bug you've run into, but would you >> >> please >> >> share some more detail (version of ZK, log context, config files, >> etc.) to >> >> get confidence? >> >> >> >> Thanks, >> >> Andor >> >> >> >> >> >> On Wed, Jan 17, 2018 at 4:36 PM, upendar devu <[email protected]> >> >> wrote: >> >> >> >> > we are getting below error twice in a month , though its auto >> resolved >> >> but >> >> > anyone can explain why this error occurring and what needs to be >> done to >> >> > prevent the error , is this common error and can be ignored? >> >> > >> >> > Please suggest. >> >> > >> >> > >> >> > 2018-01-16 20:36:17,378 [myid:2] - WARN >> >> > [RecvWorker:3:QuorumCnxManager$RecvWorker@780] - Connection broken >> for >> >> id >> >> > 3, my id = 2, error = java.net.SocketException: Socket closed at >> >> > java.net.SocketInputStream.socketRead0(Native Method) at >> >> > java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at >> >> > java.net.SocketInputStream.read(SocketInputStream.java:171) at >> >> > java.net.SocketInputStream.read(SocketInputStream.java:141) at >> >> > java.net.SocketInputStream.read(SocketInputStream.java:224) at >> >> > java.io.DataInputStream.readInt(DataInputStream.java:387) at >> >> > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run( >> >> > QuorumCnxManager.java:765) >> >> > >> >> >> > >> > >> > >
