[ 
https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13690461#comment-13690461
 ] 

Chris Riccomini commented on YARN-864:
--------------------------------------

Hey Vinod,

So here's what I know:

1. The RM is marking the NM as lost, and telling the NM to shut down.
2. The NM is trying to shut down, but appears unable to kill all containers:

{noformat}
09:37:46,132  INFO NodeStatusUpdaterImpl:365 - Node is out of sync with 
ResourceManager, hence rebooting.
09:37:47,932  INFO NodeManager:219 - Containers still running on shutdown: 
[container_1371756883564_0001_01_000003, 
container_1371756883564_0002_01_000002, container_1371756883564_0003_01_000005, 
container_1371756883564_0004_01_000001, container_1371756883564_0004_01_000002]
09:37:47,955  INFO NMAuditLogger:89 - USER=jhoman       IP=172.18.146.129       
OPERATION=Stop Container Request        TARGET=ContainerManageImpl      
RESULT=SUCCESS  APPID=application_1371756883564_0004    
CONTAINERID=container_1371756883564_0004_01_000001
09:37:47,970  INFO NodeManager:226 - Waiting for containers to be killed
09:37:47,970  WARN Server:997 - IPC Server Responder, call 
org.apache.hadoop.yarn.api.ContainerManagerPB.stopContainer from 
172.18.146.129:46200: output error
09:37:47,973  INFO Container:835 - Container 
container_1371756883564_0004_01_000001 transitioned from RUNNING to KILLING
09:37:47,989  INFO ContainerLaunch:300 - Cleaning up container 
container_1371756883564_0004_01_000001
09:37:47,989  INFO Server:1797 - IPC Server handler 6 on 45454 caught an 
exception
java.nio.channels.ClosedChannelException
        at 
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
        at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2200)
        at org.apache.hadoop.ipc.Server.access$2000(Server.java:113)
        at 
org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:946)
        at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1012)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1790)
09:37:48,837  INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 
29897 for container-id container_1371756883564_0003_01_000005: 4.2 GB of 6.1 GB 
physical memory used; 5.4 GB of 49 GB virtual memory used
09:37:48,863  INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 
28967 for container-id container_1371756883564_0004_01_000002: 16.4 GB of 19.3 
GB physical memory used; 17.0 GB of 154 GB virtual memory used
09:37:48,904  INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 
27555 for container-id container_1371756883564_0001_01_000003: 601.9 MB of 768 
MB physical memory used; 1.1 GB of 6 GB virtual memory used
09:37:48,928  INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 
27559 for container-id container_1371756883564_0002_01_000002: 731.7 MB of 1.3 
GB physical memory used; 1.6 GB of 10 GB virtual memory used
09:37:48,953  INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 
28890 for container-id container_1371756883564_0004_01_000001: 146.5 MB of 512 
MB physical memory used; 726.6 MB of 4 GB virtual memory used
09:37:51,995  INFO NodeManager:242 - Done waiting for containers to be killed. 
Still alive: [container_1371756883564_0001_01_000003, 
container_1371756883564_0002_01_000002, container_1371756883564_0003_01_000005, 
container_1371756883564_0004_01_000001, container_1371756883564_0004_01_000002]
09:37:52,054  INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 
29897 for container-id container_1371756883564_0003_01_000005: 4.2 GB of 6.1 GB 
physical memory used; 5.4 GB of 49 GB virtual memory used
09:37:52,991  INFO AbstractService:113 - 
Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is 
stopped.
09:37:53,095  INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 
28967 for container-id container_1371756883564_0004_01_000002: 16.4 GB of 19.3 
GB physical memory used; 17.0 GB of 154 GB virtual memory used
09:37:53,184  INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 
27555 for container-id container_1371756883564_0001_01_000003: 602.0 MB of 768 
MB physical memory used; 1.1 GB of 6 GB virtual memory used
09:37:53,271  INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 
27559 for container-id container_1371756883564_0002_01_000002: 731.7 MB of 1.3 
GB physical memory used; 1.6 GB of 10 GB virtual memory used
09:37:53,310  WARN AsyncDispatcher:109 - Interrupted Exception while stopping
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1143)
        at java.lang.Thread.join(Thread.java:1196)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
        at 
org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
        at 
org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
        at java.lang.Thread.run(Thread.java:619)
09:37:53,318  INFO AbstractService:113 - Service:Dispatcher is stopped.
09:37:53,759  INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 
28890 for container-id container_1371756883564_0004_01_000001: 146.5 MB of 512 
MB physical memory used; 726.6 MB of 4 GB virtual memory used
09:37:54,148  INFO log:67 - Stopped [email protected]:9999
09:37:54,250  INFO AbstractService:113 - 
Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is stopped.
09:37:54,250  INFO Server:2060 - Stopping server on 45454
09:37:54,452  INFO Server:654 - Stopping IPC Server listener on 45454
09:37:54,493  INFO Server:796 - Stopping IPC Server Responder
09:37:55,179  INFO AbstractService:113 - 
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler
 is stopped.
09:37:55,626  WARN LinuxContainerExecutor:245 - Exit code from container is : 
137
09:37:55,987  INFO AbstractService:113 - Service:Dispatcher is stopped.
09:37:55,988  WARN ContainersMonitorImpl:463 - 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 is interrupted. Exiting.
09:37:55,988  INFO AbstractService:113 - Service:containers-monitor is stopped.
09:37:55,988  INFO AbstractService:113 - 
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices 
is stopped.
09:37:55,989  INFO AbstractService:113 - Service:containers-launcher is stopped.
09:37:55,989  INFO Server:2060 - Stopping server on 4344
09:37:56,028  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: 
/cgroup/cpu/hadoop-yarn/container_1371756883564_0004_01_000002
09:37:56,028  WARN ContainerLaunch:247 - Failed to launch container.
java.io.IOException: java.lang.InterruptedException
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
        at org.apache.hadoop.util.Shell.run(Shell.java:129)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
09:37:56,107  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: 
/cgroup/cpu/hadoop-yarn/container_1371756883564_0003_01_000005
09:37:56,169  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: 
/cgroup/cpu/hadoop-yarn/container_1371756883564_0001_01_000003
09:37:56,169  WARN ContainerLaunch:247 - Failed to launch container.
java.io.IOException: java.lang.InterruptedException
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
        at org.apache.hadoop.util.Shell.run(Shell.java:129)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
09:37:56,107  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: 
/cgroup/cpu/hadoop-yarn/container_1371756883564_0002_01_000002
09:37:56,169  WARN ContainerLaunch:247 - Failed to launch container.
java.io.IOException: java.lang.InterruptedException
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
        at org.apache.hadoop.util.Shell.run(Shell.java:129)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
09:37:56,170  WARN ContainerLaunch:247 - Failed to launch container.
java.io.IOException: java.lang.InterruptedException
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
        at org.apache.hadoop.util.Shell.run(Shell.java:129)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
09:37:56,176  INFO Server:654 - Stopping IPC Server listener on 4344
09:37:56,199  INFO Server:796 - Stopping IPC Server Responder
09:37:56,327  INFO AbstractService:113 - 
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker
 is stopped.
09:37:56,327  INFO ResourceLocalizationService:689 - Public cache exiting
09:37:56,328  INFO AbstractService:113 - 
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService
 is stopped.
09:37:56,328  INFO AbstractService:113 - 
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
 is stopped.
09:37:56,328  INFO AbstractService:113 - 
Service:org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is 
stopped.
09:37:56,328  INFO AbstractService:113 - 
Service:org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService is 
stopped.
09:37:56,328  INFO AbstractService:113 - 
Service:org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService is 
stopped.
09:37:56,440  INFO AbstractService:113 - 
Service:org.apache.hadoop.yarn.server.nodemanager.DeletionService is stopped.
09:37:56,440  INFO AbstractService:113 - 
Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is stopped.
09:37:56,441  INFO MetricsSystemImpl:200 - Stopping NodeManager metrics 
system...
09:37:56,441  INFO MetricsSystemImpl:206 - NodeManager metrics system stopped.
09:37:56,442  INFO MetricsSystemImpl:572 - NodeManager metrics system shutdown 
complete.
{noformat}

3. The NM reboots itself:

09:37:56,441  INFO MetricsSystemImpl:206 - NodeManager metrics system stopped.
09:37:56,442  INFO MetricsSystemImpl:572 - NodeManager metrics system shutdown 
complete.
09:37:56,442  INFO NodeManager:304 - Rebooting the node manager.


                
> YARN NM leaking containers with CGroups
> ---------------------------------------
>
>                 Key: YARN-864
>                 URL: https://issues.apache.org/jira/browse/YARN-864
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.0.5-alpha
>         Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and 
> YARN-600.
>            Reporter: Chris Riccomini
>         Attachments: rm-log
>
>
> Hey Guys,
> I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm 
> seeing containers getting leaked by the NMs. I'm not quite sure what's going 
> on -- has anyone seen this before? I'm concerned that maybe it's a 
> mis-understanding on my part about how YARN's lifecycle works.
> When I look in my AM logs for my app (not an MR app master), I see:
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. 
> This means that container container_1371141151815_0008_03_000002 was killed 
> by YARN, either due to being released by the application master or being 
> 'lost' due to node failures etc.
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container 
> container_1371141151815_0008_03_000002 was assigned task ID 0. Requesting a 
> new container for the task.
> The AM has been running steadily the whole time. Here's what the NM logs say:
> {noformat}
> 05:34:59,783  WARN AsyncDispatcher:109 - Interrupted Exception while stopping
> java.lang.InterruptedException
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Thread.join(Thread.java:1143)
>         at java.lang.Thread.join(Thread.java:1196)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
>         at 
> org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
>         at 
> org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
>         at java.lang.Thread.run(Thread.java:619)
> 05:35:00,314  WARN ContainersMonitorImpl:463 - 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  is interrupted. Exiting.
> 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
> at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598
> 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
> at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_000002
> 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
> java.io.IOException: java.lang.InterruptedException
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
>         at org.apache.hadoop.util.Shell.run(Shell.java:129)
>         at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
> java.io.IOException: java.lang.InterruptedException
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
>         at org.apache.hadoop.util.Shell.run(Shell.java:129)
>         at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> {noformat}
> And, if I look on the machine that's running 
> container_1371141151815_0008_03_000002, I see:
> {noformat}
> $ ps -ef | grep container_1371141151815_0008_03_000002
> criccomi  5365 27915 38 Jun18 ?        21:35:05 
> /export/apps/jdk/JDK-1_6_0_21/bin/java -cp 
> /path-to-yarn-data-dir/usercache/criccomi/appcache/application_1371141151815_0008/container_1371141151815_0008_03_000002/...
> {noformat}
> The same holds true for container_1371141151815_0006_01_001598. When I look 
> in the container logs, it's just happily running. No kill signal appears to 
> be sent, and no error appears.
> Lastly, the RM logs show no major events around the time of the leak 
> (5:35am). I am able to reproduce this simply by waiting about 12 hours, or 
> so, and it seems to have started happening after I switched over to CGroups 
> and LCE, and turned on stateful RM (using file system).
> Any ideas what's going on?
> Thanks!
> Chris

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to