Hello,
We are using drill 1.11 (under yarn) on a 3 node cluster.
Occasionally a query would remain stuck in the RUNNING state. The same
query runs successfully on multiple occasions. I have not captured any
information previous times this occurred, but have collected following on
the latest occurrence -
- Full json profile
- Thread dumps on all three nodes
I can provide these if needed.
In the thread-dumps there are 107 threads tagged to the query id.
105 of them are stuck with following stack-trace -
2598df8d-8573-5e29-292c-fb343c99d280:frag:6:3 id=266 state=WAITING
- waiting on <0x4a20ff6e> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
- locked <0x4a20ff6e> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at
java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
at
java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)
at
org.apache.drill.exec.work.batch.UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(UnlimitedRawBatchBuffer.java:61)
at
org.apache.drill.exec.work.batch.BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:170)
at
org.apache.drill.exec.physical.impl.unorderedreceiver.UnorderedReceiverBatch.getNextBatch(UnorderedReceiverBatch.java:141)
at
org.apache.drill.exec.physical.impl.unorderedreceiver.UnorderedReceiverBatch.next(UnorderedReceiverBatch.java:159)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
at
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
at
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
at
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch(ExternalSortBatch.java:406)
at
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load(ExternalSortBatch.java:357)
at
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext(ExternalSortBatch.java:302)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
at
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
at
org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
at
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:92)
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
at
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
at
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Locked synchronizers: count = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@45083904
While 2 are stuck with -
2598df8d-8573-5e29-292c-fb343c99d280:frag:0:0 id=390 state=WAITING
- waiting on <0x730eeaf1> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
- locked <0x730eeaf1> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at
java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
at
java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)
at
org.apache.drill.exec.work.batch.UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(UnlimitedRawBatchBuffer.java:61)
at
org.apache.drill.exec.work.batch.BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:170)
at
org.apache.drill.exec.physical.impl.mergereceiver.MergingRecordBatch.getNext(MergingRecordBatch.java:147)
at
org.apache.drill.exec.physical.impl.mergereceiver.MergingRecordBatch.innerNext(MergingRecordBatch.java:241)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
at
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
at
org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext(LimitRecordBatch.java:115)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
at
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
at
org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
at
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
at
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
at
org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:81)
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
at
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
at
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Locked synchronizers: count = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@378527f8
Any help with regards to figuring out what is going wrong will be
appreciated. Thanks in advance!
Thanks,
Lalit Mishra