Try turning on verbose GC logging and look if it's GC. In my experience that's often the cause. "client application suddenly loses its Zookeeper session" - might be client GC, or it could be server GC. See if either of those shed light.
Patrick On Sat, Apr 6, 2019 at 1:10 PM Norbert Kalmar <[email protected]> wrote: > Hi Gelbana, > > max_latency tells you the time elapsed between creating the request and > FinalRequestProcessor processing it. So the cause for being that high could > be basically anything. > Turning on debug log for ZooKeeper server could help pinpoint at what point > the request get stuck for so long. > > Regards, > Norbert > > On Wed, Apr 3, 2019 at 3:07 PM Muhammad Gelbana <[email protected]> > wrote: > > > Another couple of things I found: > > > > *A couple of Zookeeper client threads are stuck at these stacktraces for > > ~30 seconds* > > "pool-2-thread-1-EventThread" #1218 daemon prio=5 os_prio=0 > > tid=0x00007ff3f5e23800 nid=0x5cd8 waiting on condition > [0x00007ff3ef803000] > > java.lang.Thread.State: WAITING (parking) > > at sun.misc.Unsafe.park(Native Method) > > - parking to wait for <0x000000018d6d8ed8> (a > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > at > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > > at > > > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:504) > > > > "pool-2-thread-1-SendThread(72.55.136.25:2181)" #1217 daemon prio=5 > > os_prio=0 tid=0x00007ff3f5e23000 nid=0x5cd7 runnable [0x00007ff3ef904000] > > java.lang.Thread.State: RUNNABLE > > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > > - locked <0x000000018d68a730> (a sun.nio.ch.Util$3) > > - locked <0x000000018d68a720> (a > java.util.Collections$UnmodifiableSet) > > - locked <0x000000018d68a258> (a sun.nio.ch.EPollSelectorImpl) > > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > > at > > > > > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:349) > > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145) > > > > *Running the mntr command returned the following stats* > > zk_version 3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on > > 06/29/2018 04:05 GMT > > zk_avg_latency 0 > > zk_max_latency *17657* > > zk_min_latency 0 > > zk_packets_received 1427134 > > zk_packets_sent 1596974 > > zk_num_alive_connections 64 > > zk_outstanding_requests 0 > > zk_server_state follower > > zk_znode_count 1394 > > zk_watch_count 592 > > zk_ephemerals_count 192 > > zk_approximate_data_size 181257 > > zk_open_file_descriptor_count 94 > > zk_max_file_descriptor_count 1048576 > > zk_fsync_threshold_exceed_count 1 > > > > I find the *zk_max_latency* extremely hight. I'm wondering what kind of > > latency is that ? How can I debug the reason for this value ? > > > > Thanks, > > Gelbana > > > > > > > > On Wed, Apr 3, 2019 at 1:42 PM Muhammad Gelbana <[email protected]> > > wrote: > > > > > I'm trying to debug a problem where our client application suddenly > loses > > > its Zookeeper session. I concluded that by looking at the Zookeeper > > server > > > logs. > > > > > > I increased the logging details for the client and found the following > > log > > > messages > > > > > >> DEBUG: [07:33:33] [demo | HA | Manager] Got ping response for > sessionid: > > >> 0x3000da76fa904b6 after 0ms > > >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse] > > >> DEBUG: [07:34:07] [demo | HA | Manager] Got ping response for > sessionid: > > >> 0x3000da76fa904b6 after 0ms > > >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse] > > >> DEBUG: [07:34:40] [demo | HA | Manager] Got ping response for > sessionid: > > >> 0x3000da76fa904b6 after 0ms > > >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse] > > >> DEBUG: [07:35:13] [demo | HA | Manager] Got ping response for > sessionid: > > >> 0x3000da76fa904b6 after 0ms > > >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse] > > >> DEBUG: [07:35:47] [demo | HA | Manager] Got ping response for > sessionid: > > >> 0x3000da76fa904b6 after 0ms > > >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse] > > >> DEBUG: [07:36:20] [demo | HA | Manager] Got ping response for > sessionid: > > >> 0x3000da76fa904b6 after 0ms > > >> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse] > > >> > > > > > > I noticed that the duration between each log message is ~33 seconds > while > > > on another environment (my laptop), the duration goes down to ~1 > second. > > > What could be causing this huge difference ? I doubt that whatever is > > > causing this effect causes the delay to increase significantly at some > > > point to the extend that makes my client lose its session. > > > > > >
