Jaume M created TEZ-4015:
Summary: Send killed diagnostics to the AM when ShuffleScheduler
calls killSelf
Key: TEZ-4015
URL: https://issues.apache.org/jira/browse/TEZ-4015
Project: Apache Tez
Issue Type: Improvement
Affects Versions: 0.9.1
Reporter: Jaume M
This can be useful for debugging. This in an example of the logs shown for a
particular vertex when it fails:
{code}
ERROR : FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer
2, vertexId=vertex_1540489363818_0021_2_03, diagnostics=[Task failed,
taskId=task_1540489363818_0021_2_03_35, diagnostics=[TaskAttempt 0 killed,
TaskAttempt 1 failed, info=[Error: Error while running task ( failure ) :
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
error in shuffle in Fetcher_O {Map_1} #6
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:305)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:287)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
at
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
at
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch
failures and insufficient progress!failureCounts=5, pendingInputs=286,
fetcherHealthy=false, reducerProgressedEnough=false, reducerStalled=false
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1047)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:788)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:379)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:261)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:180)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:192)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:56)
... 7 more
, errorMessage=Shuffle Runner
Failed:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
error in shuffle in Fetcher_O {Map_1} #6
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:305)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:287)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
at
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
at
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch
failures and insufficient progress!failureCounts=5, pendingInputs=286,
fetcherHealthy=false, reducerProgressedEnough=false, reducerStalled=false
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1047)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:788)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:379)
at