Dear Tez community,
a customer is running into following problem with a Tez-0.5 job.
On the client side it looks as follow:
- progress reporting shows that all tasks are complete:
(DagRunner.java:148) - Progress on 'Import job (26234): Raw History
ImportJob#import(Group by operation)#import(Group by operation)
(b22473b9-5053-40bc-9df6-02b8b38eb6d8)' in state RUNNING:
- then there seems to be connectivity problems to the Application Master (AM):
INFO [2014-11-14 14:32:03.635] (Client.java:837) - Retrying connect to
server: master/172.30.31.27:34275. Already tried 0 time(s); retry policy is
…
INFO [2014-11-14 14:32:12.644] (Client.java:837) - Retrying connect to
server: master/172.30.31.27:34275. Already tried 9 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
- then the job appears to be failed:
INFO [2014-11-14 14:32:13.852] [MrPlanRunnerV2] - Finished DAG ‘XXX'
(application=application_1414525508751_5385) with status=FAILED
On the AM side it looks as follow:
- all tasks are completed:
2014-11-14 14:31:36,442 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.dag.impl.DAGImpl: Committing output: output for vertex:
vertex_1414525508751_5385_1_01
- our implementation of org.apache.tez.runtime.api.OutputCommitter starts its
work
- the AM suddenly gets shut down:
2014-11-14 14:31:42,169 INFO [DelayedContainerManager]
org.apache.tez.dag.app.rm.YarnTaskSchedulerService: No taskRequests.
Container's idle timeout delay expired or is new. Releasing container,
containerId=container_1414525508751_5385_01_000056,
containerExpiryTime=1416000701962, idleTimeout=5000, taskRequestsCount=0,
heldContainers=1, delayedContainers=0, isNew=false
2014-11-14 14:31:42,170 INFO [DelayedContainerManager]
org.apache.tez.dag.app.rm.YarnTaskSchedulerService: Releasing unused container:
container_1414525508751_5385_01_000056
2014-11-14 14:32:01,848 INFO [
Thread
-1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMasterShutdownHook
invoked
2014-11-14 14:32:01,848 INFO [
Thread
-1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMaster received a
signal. Signaling TaskScheduler
2014-11-14 14:32:01,848 INFO [
Thread-1] org.apache.tez.dag.app.rm.TaskSchedulerEventHandler:
TaskScheduler notified that iSignalled was : true
2014-11-14 14:32:01,849 INFO [Thread-1]
org.apache.tez.dag.history.HistoryEventHandler: Stopping HistoryEventHandler
2014-11-14 14:32:01,849 INFO [
Thread
-1] org.apache.tez.dag.history.recovery.RecoveryService: Stopping
RecoveryService
2014-11-14 14:32:01,849 INFO [
Thread
-1] org.apache.tez.dag.history.recovery.RecoveryService: Closing Summary Stream
2014-11-14 14:32:01,849 INFO [RecoveryEventHandlingThread]
org.apache.tez.dag.history.recovery.RecoveryService: EventQueue take
interrupted. Returning
2014-11-14 14:32:01,897 INFO [
Thread-1] org.apache.tez.dag.history.recovery.RecoveryService: Closing Output
Stream for
DAG dag_1414525508751_5385_1
2014-11-14 14:32:01,905 INFO [
Thread
-1] org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService:
Stopping SimpleHistoryLoggingService, eventQueueBacklog=0
2014-11-14 14:32:01,905 INFO [HistoryEventHandlingThread]
org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService: EventQueue
take interrupted. Returning
2014-11-14 14:32:01,906 INFO [
Thread-1] org.apache.tez.dag.app.launcher.ContainerLauncherImpl: Sending a stop
request to the NM for
ContainerId: container_1414525508751_5385_01_000056
2014-11-14 14:32:01,913 INFO [DelayedContainerManager]
org.apache.tez.dag.app.rm.YarnTaskSchedulerService: AllocatedContainerManager
Thread
interrupted
2014-11-14 14:32:01,915 ERROR [AMRM Callback Handler
Thread] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[AMRM
Callback Handler Thread,5,main] threw an Throwable, but we are shutting down,
so ignoring this
org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
org.apache.tez.dag.api.TezUncheckedException: java.lang.InterruptedException
at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:313)
Caused by: org.apache.tez.dag.api.TezUncheckedException:
java.lang.InterruptedException
at
org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:106)
at
org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getProgress(YarnTaskSchedulerService.java:852)
at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:309)
Caused by: java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
at java.util.concurrent.FutureTask.get(FutureTask.java:187)
at
org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:104)
Any idea on what the problem could be ?
Is there a timeout for the OutputCommitter hits and the AM is killed by
resource-manager ?
Does the client kills the application via resource-manager when it can’t reach
the application master anymore ?
How big the chance is that the AM i unresponsive while its executing the
OutputCommitter ?
Any help appreciated!
Johannes