sankalp kohli created CASSANDRA-13204:
-----------------------------------------

             Summary: Thread Leak in OutboundTcpConnection
                 Key: CASSANDRA-13204
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13204
             Project: Cassandra
          Issue Type: Bug
            Reporter: sankalp kohli


We found threads leaking from OutboundTcpConnection to machines which are not 
part of the cluster and still in Gossip for some reason. There are two issues 
here, this JIRA will cover the second one which is most important. 



1) First issue is that Gossip has information about machines not in the ring 
which has been replaced out. It causes Cassandra to connect to those machines 
but due to internode auth, it wont be able to connect to them at the socket 
level.  

2) Second issue is a race between creating a connection and closing a 
connections which is triggered by the gossip bug explained above. Let me try to 
explain it using the code

In OutboundTcpConnection, we are calling closeSocket(true) which will set 
isStopped=true and also put a close sentinel into the queue to exit the thread. 
On the ack connection, Gossip tries to send a message which calls connect() 
which will block for 10 seconds which is RPC timeout. The reason we will block 
is because Cassandra might not be running there or internode auth will not let 
it connect. During this 10 seconds, if Gossip calls closeSocket, it will put 
close sentinel into the queue. When we return from the connect method after 10 
seconds, we will clear the backlog queue causing this thread to leak. 

Proofs from the heap dump of the affected machine which is leaking threads 
1. Only ack connection is leaking and not the command connection which is not 
used by Gossip. 
2. We see thread blocked on the backlog queue, isStopped=true and backlog queue 
is empty. This is happening on the threads which have already leaked. 
3. A running thread was blocked on the connect waiting for timeout(10 seconds) 
and we see backlog queue to contain the close sentinel. Once the connect will 
return false, we will clear the backlog and this thread will have leaked.  


Interesting bits from j stack 
1282 number of threads for "MessagingService-Outgoing-/<IP-Address>"

Thread which is about to leak:
"MessagingService-Outgoing-/<IP Address>" 
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.Net.connect0(Native Method)
        at sun.nio.ch.Net.connect(Net.java:454)
        at sun.nio.ch.Net.connect(Net.java:446)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
        - locked <> (a java.lang.Object)
        - locked <> (a java.lang.Object)
        - locked <> (a java.lang.Object)
        at 
org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:137)
        at 
org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:119)
        at 
org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:381)
        at 
org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:217)

Thread already leaked:
"MessagingService-Outgoing-/<IP Address>"
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at 
org.apache.cassandra.utils.CoalescingStrategies$DisabledCoalescingStrategy.coalesceInternal(CoalescingStrategies.java:482)
        at 
org.apache.cassandra.utils.CoalescingStrategies$CoalescingStrategy.coalesce(CoalescingStrategies.java:213)
        at 
org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:190)




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to