Benjamin Roth created CASSANDRA-12689:
-----------------------------------------

             Summary: Alle MutationStage threads blocked, kills server
                 Key: CASSANDRA-12689
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12689
             Project: Cassandra
          Issue Type: Bug
          Components: Local Write-Read Paths
            Reporter: Benjamin Roth
            Priority: Critical


Under heavy load (e.g. due to repair during normal operations), a lot of 
NullPointerExceptions occur in MutationStage. Unfortunately, the log is not 
very chatty, trace is missing:
2016-09-22T06:29:47+00:00 cas6 [MutationStage-1] 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService Uncaught 
exception on thread Thread[MutationStage-1,5,main]: {}
2016-09-22T06:29:47+00:00 cas6 #011java.lang.NullPointerException: null

Then, after some time, in most cases ALL threads in MutationStage pools are 
completely blocked. This leads to piling up pending tasks until server runs OOM 
and is completely unresponsive due to GC. Threads will NEVER unblock until 
server restart. Even if load goes completely down, all hints are paused, and no 
compaction or repair is running. Only restart helps.

I can understand that pending tasks in MutationStage may pile up under heavy 
load, but tasks should be processed and dequeud after load goes down. This is 
definitively not the case. This looks more like a an unhandled exception 
leading to a stuck lock.

Stack trace from jconsole, all Threads in MutationStage show same trace.

Name: MutationStage-48
State: WAITING on java.util.concurrent.CompletableFuture$Signaller@fcc8266
Total blocked: 137  Total waited: 138.513

Stack trace: 
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
org.apache.cassandra.db.Mutation.apply(Mutation.java:227)
org.apache.cassandra.db.Mutation.apply(Mutation.java:241)
org.apache.cassandra.hints.Hint.apply(Hint.java:96)
org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:91)
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109)
java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to