Rushabh Shah created ZOOKEEPER-4673:
---------------------------------------

             Summary: Deadlock in zookeeper client.
                 Key: ZOOKEEPER-4673
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4673
             Project: ZooKeeper
          Issue Type: Bug
          Components: java client
    Affects Versions: 3.4.13
            Reporter: Rushabh Shah


We use apache curator 2.10.0 version
We use zookeeper some version of 3.4.13 (I know its very old. :( ) We also have 
cherry-picked ZOOKEEPER-2251 in our branch.

Threads in deadlock.
{noformat}

Found one Java-level deadlock:
=============================
"Curator-Framework-0":
  waiting to lock monitor 0x00007f34f2163a00 (object 0x00007f344228db00, a 
java.util.LinkedList),
  which is held by "Curator-Framework-0-SendThread(<zk server>:2181)"
"Curator-Framework-0-SendThread(<zk server>:2181)":
  waiting to lock monitor 0x00007f2109731980 (object 0x00007f2270fd02a8, a 
org.apache.curator.ConnectionState),
  which is held by "Curator-Framework-0"



"Curator-Framework-0" #1493 daemon prio=5 os_prio=0 cpu=98.73ms 
elapsed=50812.45s tid=0x00007f2128da5de0 nid=0x2889 waiting for monitor entry  
[0x00007f2053436000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1245)
        - waiting to lock <0x00007f344228db00> (a java.util.LinkedList)
        at 
org.apache.zookeeper.ClientCnxn$SendThread.cleanAndNotifyState(ClientCnxn.java:1174)
        at 
org.apache.zookeeper.ClientCnxn$SendThread.access$2300(ClientCnxn.java:700)
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1420)
        at org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1373)
        at org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:788)
        - locked <0x00007f344228cd78> (a org.apache.zookeeper.ZooKeeper)
        at org.apache.curator.HandleHolder.internalClose(HandleHolder.java:139)
        at org.apache.curator.HandleHolder.closeAndReset(HandleHolder.java:77)
        at org.apache.curator.ConnectionState.reset(ConnectionState.java:217)
        - locked <0x00007f2270fd02a8> (a org.apache.curator.ConnectionState)
        at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193)
        - locked <0x00007f2270fd02a8> (a org.apache.curator.ConnectionState)
        at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
        at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:116)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
        at 
java.util.concurrent.FutureTask.run(java.base@11.0.16.1.1/FutureTask.java:264)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.16.1.1/ScheduledThreadPoolExecutor.java:304)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.16.1.1/ThreadPoolExecutor.java:1128)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.16.1.1/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java.base@11.0.16.1.1/Thread.java:829)


"Curator-Framework-0-SendThread(<zkserver>:2181)" #1524807 daemon prio=5 
os_prio=0 cpu=25.58ms elapsed=101.61s tid=0x00007f1f3836d0c0 nid=0x113ad 
waiting for monitor entry  [0x00007f212767b000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:176)
        - waiting to lock <0x00007f2270fd02a8> (a 
org.apache.curator.ConnectionState)
        at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
        at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:116)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.doSyncForSuspendedConnection(CuratorFrameworkImpl.java:701)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$700(CuratorFrameworkImpl.java:64)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$7.retriesExhausted(CuratorFrameworkImpl.java:697)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:716)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:516)
        at 
org.apache.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl.java:50)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:611)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.queuePacket(ClientCnxn.java:462)
        - locked <0x00007f344228e0c0> (a 
java.util.concurrent.LinkedBlockingQueue)
        at org.apache.zookeeper.ClientCnxn.finishPacket(ClientCnxn.java:632)
        at org.apache.zookeeper.ClientCnxn.conLossPacket(ClientCnxn.java:650)
        at org.apache.zookeeper.ClientCnxn.access$2100(ClientCnxn.java:94)
        at 
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1246)
        - locked <0x00007f344228db00> (a java.util.LinkedList)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1159)

{noformat}

In 3.4.13 version, 
[ClientCnxn#outgoingQueue|https://github.com/apache/zookeeper/blob/branch-3.4.13/src/java/main/org/apache/zookeeper/ClientCnxn.java#L146]
 type is LinkedList NOT LinkedBlockingDeque and we have many places where we 
synchronize on outgoingQueue.

The client application was spending lot of time in garbage collection and the 
request to server were timing out.
Logline from +CuratorFramework-0+ thread before the jstack
{noformat}
 2023-02-03 08:29:41:973 #1493 Timeout error occurred for the packet 
clientPath:null serverPath:null finished:false header:: 0,-11  replyHeader:: 
0,0,0  request:: null response:: null
{noformat}


Logline from +CuratorFramework-0-SendThread+ thread before the jstack.
{noformat}
2023-02-03 08:29:26:371 #1524807 Opening socket connection to server <zk 
server>:2181. Will attempt to SASL-authenticate using Login Context section 
'Client'
2023-02-03 08:29:26.371 #1524807 Socket connection established to <zk 
server>:2181, initiating session
2023-02-03 08:29:26.388 #1524807 Unable to reconnect to ZooKeeper service, 
session 0x1002232ea037548 has expired
{noformat}

+Curator-Framework-0+ thread acquired org.apache.curator.ConnectionState lock 
and waiting for ClientCnxn#outgoingQueue lock
+Curator-Framework-0-SendThread+  acquired ClientCnxn#outgoingQueue lock and 
waiting for org.apache.curator.ConnectionState lock.

>From the above thread dump, Curator-Framework-0 is trying to call 
>SendThread#cleanAndNotifyState which is NOT correct. In my opinion it should 
>just interrupt the sendThread and let SendThread cleanup its state. 
Here is the relevant code 
[link|https://github.com/apache/zookeeper/blob/branch-3.7/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1602-L1604]

{code}
        if (r.getErr() == Code.REQUESTTIMEOUT.intValue()) {
            sendThread.cleanAndNotifyState(); ---> We should NOT call this 
directly instead call sendThread.interrupt and let it clean its state.
        }
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to