[
https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212156#comment-14212156
]
Hudson commented on YARN-2846:
------------------------------
FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/])
YARN-2846. Incorrect persist exit code for running containers in
reacquireContainer() that interrupted by NodeManager restart. Contributed by
Junping Du (jlowe: rev 33ea5ae92b9dd3abace104903d9a94d17dd75af5)
* hadoop-yarn-project/CHANGES.txt
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
> Incorrect persist exit code for running containers in reacquireContainer()
> that interrupted by NodeManager restart.
> -------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-2846
> URL: https://issues.apache.org/jira/browse/YARN-2846
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Junping Du
> Assignee: Junping Du
> Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: YARN-2846-demo.patch, YARN-2846.patch
>
>
> The NM restart work preserving feature could make running AM container get
> LOST and killed during stop NM daemon. The exception is like below:
> {code}
> 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl
> (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for
> container-id container_1415666714233_0001_01_000084: 53.8 MB of 512 MB
> physical memory used; 931.3 MB of 1.0 GB virtual memory used
> 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager
> (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM
> 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped
> [email protected]:50060
> 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl
> (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) -
> Applications still running : [application_1415666714233_0001]
> 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping
> server on 45454
> 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping
> IPC Server listener on 45454
> 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService
> (LogAggregationService.java:serviceStop(141)) -
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
> waiting for pending aggregation during exit
> 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping
> IPC Server Responder
> 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl
> (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log
> aggregation for application_1415666714233_0001
> 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl
> (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for
> application application_1415666714233_0001
> 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl
> (ContainersMonitorImpl.java:run(476)) -
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> is interrupted. Exiting.
> 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch
> (RecoveredContainerLaunch.java:call(87)) - Unable to recover container
> container_1415666714233_0001_01_000001
> java.io.IOException: Interrupted while waiting for process 20001 to exit
> at
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException: sleep interrupted
> at java.lang.Thread.sleep(Native Method)
> at
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177)
> ... 6 more
> {code}
> In reacquireContainer() of ContainerExecutor.java, the while loop of checking
> container process (AM container) will be interrupted by NM stop. The
> IOException get thrown and failed to generate an ExitCodeFile for the running
> container. Later, the IOException will be caught in upper call
> (RecoveredContainerLaunch.call()) and the ExitCode (by default to be LOST
> without any setting) get persistent in NMStateStore.
> After NM restart again, this container is recovered as COMPLETE state but
> exit code is LOST (154) - cause this (AM) container get killed later.
> We should get rid of recording the exit code of running containers if
> detecting process is interrupted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)