[
https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206616#comment-14206616
]
Jason Lowe commented on YARN-2846:
----------------------------------
Thanks for the report and patch, Junping!
Nit: If reacquireContainer is going to allow InterruptedException to be thrown
then I'd rather remove the try/catch around the Thread.sleep call and just let
the exception be thrown directly from there. We can let the code catching the
exception deal with any logging/etc as appropriate for that caller. In this
case we can move the log message to RecoveredContainerLaunch when it fields the
InterruptedException and chooses not to propagate it upwards.
I'm curious why we're not seeing a similar issue with regular ContainerLaunch
threads, as they should be interrupted as well. Are those threads silently
swallowing the interrupt? Because otherwise I would expect us to log a
container completion just like we were doing with a recovered container.
> Incorrect persist exit code for running containers in reacquireContainer()
> that interrupted by NodeManager restart.
> -------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-2846
> URL: https://issues.apache.org/jira/browse/YARN-2846
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Junping Du
> Assignee: Junping Du
> Priority: Blocker
> Attachments: YARN-2846-demo.patch
>
>
> The NM restart work preserving feature could make running AM container get
> LOST and killed during stop NM daemon. The exception is like below:
> {code}
> 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl
> (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for
> container-id container_1415666714233_0001_01_000084: 53.8 MB of 512 MB
> physical memory used; 931.3 MB of 1.0 GB virtual memory used
> 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager
> (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM
> 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped
> [email protected]:50060
> 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl
> (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) -
> Applications still running : [application_1415666714233_0001]
> 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping
> server on 45454
> 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping
> IPC Server listener on 45454
> 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService
> (LogAggregationService.java:serviceStop(141)) -
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
> waiting for pending aggregation during exit
> 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping
> IPC Server Responder
> 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl
> (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log
> aggregation for application_1415666714233_0001
> 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl
> (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for
> application application_1415666714233_0001
> 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl
> (ContainersMonitorImpl.java:run(476)) -
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> is interrupted. Exiting.
> 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch
> (RecoveredContainerLaunch.java:call(87)) - Unable to recover container
> container_1415666714233_0001_01_000001
> java.io.IOException: Interrupted while waiting for process 20001 to exit
> at
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException: sleep interrupted
> at java.lang.Thread.sleep(Native Method)
> at
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177)
> ... 6 more
> {code}
> In reacquireContainer() of ContainerExecutor.java, the while loop of checking
> container process (AM container) will be interrupted by NM stop. The
> IOException get thrown and failed to generate an ExitCodeFile for the running
> container. Later, the IOException will be caught in upper call
> (RecoveredContainerLaunch.call()) and the ExitCode (by default to be LOST
> without any setting) get persistent in NMStateStore.
> After NM restart again, this container is recovered as COMPLETE state but
> exit code is LOST (154) - cause this (AM) container get killed later.
> We should get rid of recording the exit code of running containers if
> detecting process is interrupted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)