The signal to the AM could have come in from the NodeManager if the AM
exceeds it's memory limits. Do you have YARN memory enforcement enabled on
this cluster ? You could look up the NM logs to see if this were the case.
Otherwise the RM logs to figure out when and if a KILL request for the app
was sent.

On Mon, Nov 17, 2014 at 6:45 AM, Johannes Zillmann <[email protected]
> wrote:

> Dear Tez community,
>
> a customer is running into following problem with a Tez-0.5 job.
>
> On the client side it looks as follow:
>
> - progress reporting shows that all tasks are complete:
>         (DagRunner.java:148) - Progress on 'Import job (26234): Raw
> History ImportJob#import(Group by operation)#import(Group by operation)
> (b22473b9-5053-40bc-9df6-02b8b38eb6d8)' in state RUNNING:
> - then there seems to be connectivity problems to the Application Master
> (AM):
>         INFO [2014-11-14 14:32:03.635] (Client.java:837) - Retrying
> connect to server: master/172.30.31.27:34275. Already tried 0 time(s);
> retry policy is
>         …
>         INFO [2014-11-14 14:32:12.644] (Client.java:837) - Retrying
> connect to server: master/172.30.31.27:34275. Already tried 9 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1000 MILLISECONDS)
> - then the job appears to be failed:
>         INFO [2014-11-14 14:32:13.852] [MrPlanRunnerV2] - Finished DAG
> ‘XXX' (application=application_1414525508751_5385) with status=FAILED
>
>
> On the AM side it looks as follow:
> - all tasks are completed:
>         2014-11-14 14:31:36,442 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.dag.impl.DAGImpl: Committing output: output for
> vertex: vertex_1414525508751_5385_1_01
> - our implementation of org.apache.tez.runtime.api.OutputCommitter starts
> its work
> - the AM suddenly gets shut down:
>         2014-11-14 14:31:42,169 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService: No taskRequests.
> Container's idle timeout delay expired or is new. Releasing container,
> containerId=container_1414525508751_5385_01_000056,
> containerExpiryTime=1416000701962, idleTimeout=5000, taskRequestsCount=0,
> heldContainers=1, delayedContainers=0, isNew=false
>         2014-11-14 14:31:42,170 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService: Releasing unused
> container: container_1414525508751_5385_01_000056
>         2014-11-14 14:32:01,848 INFO [
>         Thread
>         -1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMasterShutdownHook
> invoked
>         2014-11-14 14:32:01,848 INFO [
>         Thread
>         -1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMaster received a
> signal. Signaling TaskScheduler
>         2014-11-14 14:32:01,848 INFO [
>         Thread-1] org.apache.tez.dag.app.rm.TaskSchedulerEventHandler:
> TaskScheduler notified that iSignalled was : true
>         2014-11-14 14:32:01,849 INFO [Thread-1]
> org.apache.tez.dag.history.HistoryEventHandler: Stopping HistoryEventHandler
> 2014-11-14 14:32:01,849 INFO [
> Thread
> -1] org.apache.tez.dag.history.recovery.RecoveryService: Stopping
> RecoveryService
> 2014-11-14 14:32:01,849 INFO [
> Thread
> -1] org.apache.tez.dag.history.recovery.RecoveryService: Closing Summary
> Stream
> 2014-11-14 14:32:01,849 INFO [RecoveryEventHandlingThread]
> org.apache.tez.dag.history.recovery.RecoveryService: EventQueue take
> interrupted. Returning
> 2014-11-14 14:32:01,897 INFO [
> Thread-1] org.apache.tez.dag.history.recovery.RecoveryService: Closing
> Output Stream for
> DAG dag_1414525508751_5385_1
> 2014-11-14 14:32:01,905 INFO [
> Thread
> -1] org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService:
> Stopping SimpleHistoryLoggingService, eventQueueBacklog=0
> 2014-11-14 14:32:01,905 INFO [HistoryEventHandlingThread]
> org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService:
> EventQueue take interrupted. Returning
> 2014-11-14 14:32:01,906 INFO [
> Thread-1] org.apache.tez.dag.app.launcher.ContainerLauncherImpl: Sending a
> stop request to the NM for
> ContainerId: container_1414525508751_5385_01_000056
> 2014-11-14 14:32:01,913 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService:
> AllocatedContainerManager
> Thread
> interrupted
> 2014-11-14 14:32:01,915 ERROR [AMRM Callback Handler
> Thread] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread
> Thread[AMRM Callback Handler Thread,5,main] threw an Throwable, but we are
> shutting down, so ignoring this
>
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
> org.apache.tez.dag.api.TezUncheckedException: java.lang.InterruptedException
>         at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:313)
> Caused by: org.apache.tez.dag.api.TezUncheckedException:
> java.lang.InterruptedException
>         at
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:106)
>         at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getProgress(YarnTaskSchedulerService.java:852)
>         at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:309)
> Caused by: java.lang.InterruptedException
>         at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:187)
>         at
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:104)
>
> Any idea on what the problem could be ?
> Is there a timeout for the OutputCommitter hits and the AM is killed by
> resource-manager ?
> Does the client kills the application via resource-manager when it can’t
> reach the application master anymore ?
> How big the chance is that the AM i unresponsive while its executing the
> OutputCommitter ?
>
> Any help appreciated!
> Johannes

Reply via email to