Hi Lalit
Your profile hints that it is stuck in the Major Fragment 06-xx-xx, which is 
fed data from 16-xx-xx via 11-Exchange.

Looking at the operators’ overview and the similarity with other major 
fragments, only this one seems to be stuck at completing the sort.

Could you provide the JStack on any of the nodes which are hosting fragments of 
06-xx-xx ?

Thanks
Kunal

From: Lalit Mishra [mailto:[email protected]]
Sent: Thursday, January 25, 2018 4:03 AM
To: [email protected]
Subject: Re: Queries getting stuck in RUNNING state occasionally

Hello Timothy,

PFA the profile file (it exceeded message limit, so I had to gzip it). Please 
excuse the length of query, it is a long query unioned 5 times. I have tried to 
reproduce with a smaller query, but have failed so far.

Yes, we are using MapR 6.0.

Thanks,
Lalit Mishra

On Thu, Jan 25, 2018 at 2:37 AM, Timothy Farkas 
<[email protected]<mailto:[email protected]>> wrote:


On 2018/01/23 14:04:42, Lalit Mishra 
<[email protected]<mailto:[email protected]>> wrote:
> Hello,
>
> We are using drill 1.11 (under yarn) on a 3 node cluster.
> Occasionally a query would remain stuck in the RUNNING state. The same
> query runs successfully on multiple occasions. I have not captured any
> information previous times this occurred, but have collected following on
> the latest occurrence -
>
>    - Full json profile
>    - Thread dumps on all three nodes
>
> I can provide these if needed.
>
> In the thread-dumps there are 107 threads tagged to the query id.
> 105 of them are stuck with following stack-trace -
>
> 2598df8d-8573-5e29-292c-fb343c99d280:frag:6:3 id=266 state=WAITING
>     - waiting on <0x4a20ff6e> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     - locked <0x4a20ff6e> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     at sun.misc.Unsafe.park(Native Method)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at
> java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
>     at
> java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(UnlimitedRawBatchBuffer.java:61)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:170)
>     at
> org.apache.drill.exec.physical.impl.unorderedreceiver.UnorderedReceiverBatch.getNextBatch(UnorderedReceiverBatch.java:141)
>     at
> org.apache.drill.exec.physical.impl.unorderedreceiver.UnorderedReceiverBatch.next(UnorderedReceiverBatch.java:159)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch(ExternalSortBatch.java:406)
>     at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load(ExternalSortBatch.java:357)
>     at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext(ExternalSortBatch.java:302)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
>     at
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:92)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
>     at
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
>
>     Locked synchronizers: count = 1
>       - 
> java.util.concurrent.ThreadPoolExecutor$Worker@45083904<mailto:java.util.concurrent.ThreadPoolExecutor$Worker@45083904>
>
>
> While 2 are stuck with -
>
> 2598df8d-8573-5e29-292c-fb343c99d280:frag:0:0 id=390 state=WAITING
>     - waiting on <0x730eeaf1> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     - locked <0x730eeaf1> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     at sun.misc.Unsafe.park(Native Method)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at
> java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
>     at
> java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(UnlimitedRawBatchBuffer.java:61)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:170)
>     at
> org.apache.drill.exec.physical.impl.mergereceiver.MergingRecordBatch.getNext(MergingRecordBatch.java:147)
>     at
> org.apache.drill.exec.physical.impl.mergereceiver.MergingRecordBatch.innerNext(MergingRecordBatch.java:241)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext(LimitRecordBatch.java:115)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
>     at
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:81)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
>     at
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
>
>     Locked synchronizers: count = 1
>       - 
> java.util.concurrent.ThreadPoolExecutor$Worker@378527f8<mailto:java.util.concurrent.ThreadPoolExecutor$Worker@378527f8>
>
>
> Any help with regards to figuring out what is going wrong will be
> appreciated. Thanks in advance!
>
> Thanks,
> Lalit Mishra
>
Hi Lalit,

The stack traces you provided indicate that down stream operators are waiting 
for data to be sent by upstream operators which are blocked. This could mean 
that a scan operator is blocked reading from a data source, or it could mean 
that an operator like Sort or HashAgg is getting stuck. Can you please provide 
the query you are using along with the json profile?

Also please note that Apache Drill does not have YARN support yet, the PR is 
pending here 
https://github.com/apache/drill/pull/1011<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_drill_pull_1011&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=5S3fhzWCf4BMewMoMObRX36hSj1Nb5UbrDTA07DXmD4&e=>
 . So are you using MapR's proprietary distribution of Drill?

Thanks,
Tim

Reply via email to