The signal to the AM could have come in from the NodeManager if the AM exceeds it's memory limits. Do you have YARN memory enforcement enabled on this cluster ? You could look up the NM logs to see if this were the case. Otherwise the RM logs to figure out when and if a KILL request for the app was sent.
On Mon, Nov 17, 2014 at 6:45 AM, Johannes Zillmann <[email protected] > wrote: > Dear Tez community, > > a customer is running into following problem with a Tez-0.5 job. > > On the client side it looks as follow: > > - progress reporting shows that all tasks are complete: > (DagRunner.java:148) - Progress on 'Import job (26234): Raw > History ImportJob#import(Group by operation)#import(Group by operation) > (b22473b9-5053-40bc-9df6-02b8b38eb6d8)' in state RUNNING: > - then there seems to be connectivity problems to the Application Master > (AM): > INFO [2014-11-14 14:32:03.635] (Client.java:837) - Retrying > connect to server: master/172.30.31.27:34275. Already tried 0 time(s); > retry policy is > … > INFO [2014-11-14 14:32:12.644] (Client.java:837) - Retrying > connect to server: master/172.30.31.27:34275. Already tried 9 time(s); > retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) > - then the job appears to be failed: > INFO [2014-11-14 14:32:13.852] [MrPlanRunnerV2] - Finished DAG > ‘XXX' (application=application_1414525508751_5385) with status=FAILED > > > On the AM side it looks as follow: > - all tasks are completed: > 2014-11-14 14:31:36,442 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.DAGImpl: Committing output: output for > vertex: vertex_1414525508751_5385_1_01 > - our implementation of org.apache.tez.runtime.api.OutputCommitter starts > its work > - the AM suddenly gets shut down: > 2014-11-14 14:31:42,169 INFO [DelayedContainerManager] > org.apache.tez.dag.app.rm.YarnTaskSchedulerService: No taskRequests. > Container's idle timeout delay expired or is new. Releasing container, > containerId=container_1414525508751_5385_01_000056, > containerExpiryTime=1416000701962, idleTimeout=5000, taskRequestsCount=0, > heldContainers=1, delayedContainers=0, isNew=false > 2014-11-14 14:31:42,170 INFO [DelayedContainerManager] > org.apache.tez.dag.app.rm.YarnTaskSchedulerService: Releasing unused > container: container_1414525508751_5385_01_000056 > 2014-11-14 14:32:01,848 INFO [ > Thread > -1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMasterShutdownHook > invoked > 2014-11-14 14:32:01,848 INFO [ > Thread > -1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMaster received a > signal. Signaling TaskScheduler > 2014-11-14 14:32:01,848 INFO [ > Thread-1] org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: > TaskScheduler notified that iSignalled was : true > 2014-11-14 14:32:01,849 INFO [Thread-1] > org.apache.tez.dag.history.HistoryEventHandler: Stopping HistoryEventHandler > 2014-11-14 14:32:01,849 INFO [ > Thread > -1] org.apache.tez.dag.history.recovery.RecoveryService: Stopping > RecoveryService > 2014-11-14 14:32:01,849 INFO [ > Thread > -1] org.apache.tez.dag.history.recovery.RecoveryService: Closing Summary > Stream > 2014-11-14 14:32:01,849 INFO [RecoveryEventHandlingThread] > org.apache.tez.dag.history.recovery.RecoveryService: EventQueue take > interrupted. Returning > 2014-11-14 14:32:01,897 INFO [ > Thread-1] org.apache.tez.dag.history.recovery.RecoveryService: Closing > Output Stream for > DAG dag_1414525508751_5385_1 > 2014-11-14 14:32:01,905 INFO [ > Thread > -1] org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService: > Stopping SimpleHistoryLoggingService, eventQueueBacklog=0 > 2014-11-14 14:32:01,905 INFO [HistoryEventHandlingThread] > org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService: > EventQueue take interrupted. Returning > 2014-11-14 14:32:01,906 INFO [ > Thread-1] org.apache.tez.dag.app.launcher.ContainerLauncherImpl: Sending a > stop request to the NM for > ContainerId: container_1414525508751_5385_01_000056 > 2014-11-14 14:32:01,913 INFO [DelayedContainerManager] > org.apache.tez.dag.app.rm.YarnTaskSchedulerService: > AllocatedContainerManager > Thread > interrupted > 2014-11-14 14:32:01,915 ERROR [AMRM Callback Handler > Thread] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[AMRM Callback Handler Thread,5,main] threw an Throwable, but we are > shutting down, so ignoring this > > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > org.apache.tez.dag.api.TezUncheckedException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:313) > Caused by: org.apache.tez.dag.api.TezUncheckedException: > java.lang.InterruptedException > at > org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:106) > at > org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getProgress(YarnTaskSchedulerService.java:852) > at > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:309) > Caused by: java.lang.InterruptedException > at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400) > at java.util.concurrent.FutureTask.get(FutureTask.java:187) > at > org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:104) > > Any idea on what the problem could be ? > Is there a timeout for the OutputCommitter hits and the AM is killed by > resource-manager ? > Does the client kills the application via resource-manager when it can’t > reach the application master anymore ? > How big the chance is that the AM i unresponsive while its executing the > OutputCommitter ? > > Any help appreciated! > Johannes
