[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212378#comment-14212378 ]
Hudson commented on YARN-2846: ------------------------------ SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/]) YARN-2846. Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. Contributed by Junping Du (jlowe: rev 33ea5ae92b9dd3abace104903d9a94d17dd75af5) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java > Incorrect persist exit code for running containers in reacquireContainer() > that interrupted by NodeManager restart. > ------------------------------------------------------------------------------------------------------------------- > > Key: YARN-2846 > URL: https://issues.apache.org/jira/browse/YARN-2846 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: Junping Du > Assignee: Junping Du > Priority: Blocker > Fix For: 2.6.0 > > Attachments: YARN-2846-demo.patch, YARN-2846.patch > > > The NM restart work preserving feature could make running AM container get > LOST and killed during stop NM daemon. The exception is like below: > {code} > 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for > container-id container_1415666714233_0001_01_000084: 53.8 MB of 512 MB > physical memory used; 931.3 MB of 1.0 GB virtual memory used > 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager > (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM > 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped > HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 > 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - > Applications still running : [application_1415666714233_0001] > 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping > server on 45454 > 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping > IPC Server listener on 45454 > 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService > (LogAggregationService.java:serviceStop(141)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService > waiting for pending aggregation during exit > 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping > IPC Server Responder > 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log > aggregation for application_1415666714233_0001 > 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for > application application_1415666714233_0001 > 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(476)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl > is interrupted. Exiting. > 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch > (RecoveredContainerLaunch.java:call(87)) - Unable to recover container > container_1415666714233_0001_01_000001 > java.io.IOException: Interrupted while waiting for process 20001 to exit > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177) > ... 6 more > {code} > In reacquireContainer() of ContainerExecutor.java, the while loop of checking > container process (AM container) will be interrupted by NM stop. The > IOException get thrown and failed to generate an ExitCodeFile for the running > container. Later, the IOException will be caught in upper call > (RecoveredContainerLaunch.call()) and the ExitCode (by default to be LOST > without any setting) get persistent in NMStateStore. > After NM restart again, this container is recovered as COMPLETE state but > exit code is LOST (154) - cause this (AM) container get killed later. > We should get rid of recording the exit code of running containers if > detecting process is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)