[
https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13691294#comment-13691294
]
Chris Riccomini commented on YARN-864:
--------------------------------------
Hey Guys,
Container leaking still seems to be happening. [~ojoshi], here's the logs you
asked for:
{noformat}
10:28:38,753 INFO NodeStatusUpdaterImpl:365 - Node is out of sync with
ResourceManager, hence rebooting.
10:28:40,306 INFO NMAuditLogger:89 - USER=criccomi IP=172.18.146.129
OPERATION=Stop Container Request TARGET=ContainerManageImpl
RESULT=SUCCESS APPID=application_1371849977601_0001
CONTAINERID=container_1371849977601_0001_02_000001
10:28:40,345 INFO NodeManager:229 - Containers still running on shutdown:
[container_1371849977601_0001_02_000001,
container_1371849977601_0001_02_000003, container_1371849977601_0002_02_000003,
container_1371849977601_0003_01_000004, container_1371849977601_0004_01_000002]
10:28:40,355 INFO Container:835 - Container
container_1371849977601_0001_02_000001 transitioned from RUNNING to KILLING
10:28:40,375 INFO ContainerLaunch:300 - Cleaning up container
container_1371849977601_0001_02_000001
10:28:40,376 INFO NodeManager:236 - Waiting for containers to be killed
10:28:40,377 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 1,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 1, }, state:
C_RUNNING, diagnostics: "Container killed by the ApplicationMaster.\n",
exit_status: -1000,
10:28:40,377 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 1,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 3, }, state:
C_RUNNING, diagnostics: "", exit_status: -1000,
10:28:40,377 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 2,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 3, }, state:
C_RUNNING, diagnostics: "", exit_status: -1000,
10:28:40,377 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 3,
cluster_timestamp: 1371849977601, }, attemptId: 1, }, id: 4, }, state:
C_RUNNING, diagnostics: "", exit_status: -1000,
10:28:40,377 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 4,
cluster_timestamp: 1371849977601, }, attemptId: 1, }, id: 2, }, state:
C_RUNNING, diagnostics: "", exit_status: -1000,
10:28:41,378 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 1,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 1, }, state:
C_RUNNING, diagnostics: "Container killed by the ApplicationMaster.\n",
exit_status: -1000,
10:28:41,378 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 1,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 3, }, state:
C_RUNNING, diagnostics: "", exit_status: -1000,
10:28:41,378 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 2,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 3, }, state:
C_RUNNING, diagnostics: "", exit_status: -1000,
10:28:41,378 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 3,
cluster_timestamp: 1371849977601, }, attemptId: 1, }, id: 4, }, state:
C_RUNNING, diagnostics: "", exit_status: -1000,
10:28:41,379 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 4,
cluster_timestamp: 1371849977601, }, attemptId: 1, }, id: 2, }, state:
C_RUNNING, diagnostics: "", exit_status: -1000,
10:28:41,555 INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 4230
for container-id container_1371849977601_0001_02_000001: 161.0 MB of 512 MB
physical memory used; 726.2 MB of 4 GB virtual memory used
10:28:41,802 INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 4324
for container-id container_1371849977601_0001_02_000003: 522.9 MB of 768 MB
physical memory used; 1.1 GB of 6 GB virtual memory used
10:28:41,844 INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 5717
for container-id container_1371849977601_0002_02_000003: 608.3 MB of 1.3 GB
physical memory used; 1.6 GB of 10 GB virtual memory used10:28:41,869 INFO
ContainersMonitorImpl:399 - Memory usage of ProcessTree 26908 for container-id
container_1371849977601_0004_01_000002: 16.4 GB of 19.3 GB physical memory
used; 17.0 GB of 154 GB virtual memory used
10:28:41,896 INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree
27868 for container-id container_1371849977601_0003_01_000004: 4.2 GB of 6.1 GB
physical memory used; 5.4 GB of 49 GB virtual memory used
10:28:42,186 WARN LinuxContainerExecutor:245 - Exit code from container is :
137
10:28:42,382 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 1,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 1, }, state:
C_RUNNING, diagnostics: "Container killed by the ApplicationMaster.\nContainer
killed on request. Exit code is 137\n", exit_status: -1000,
10:28:42,383 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 1,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 3, }, state:
C_RUNNING, diag
nostics: "", exit_status: -1000,
10:28:42,383 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 2,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 3, }, state:
C_RUNNING, diag
nostics: "", exit_status: -1000,
10:28:42,383 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 3,
cluster_timestamp: 1371849977601, }, attemptId: 1, }, id: 4, }, state:
C_RUNNING, diag
nostics: "", exit_status: -1000,
10:28:42,383 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 4,
cluster_timestamp: 1371849977601, }, attemptId: 1, }, id: 2, }, state:
C_RUNNING, diag
nostics: "", exit_status: -1000,
10:28:43,389 INFO Container:835 - Container
container_1371849977601_0001_02_000001 transitioned from KILLING to
CONTAINER_CLEANEDUP_AFTER_KILL
10:28:43,390 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 1,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 1, }, state:
C_RUNNING, diag
nostics: "Container killed by the ApplicationMaster.\nContainer killed on
request. Exit code is 137\n", exit_status: 137,
10:28:42,383 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 1,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 3, }, state:
C_RUNNING, diag
nostics: "", exit_status: -1000,
10:28:42,383 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 2,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 3, }, state:
C_RUNNING, diag
nostics: "", exit_status: -1000,
10:28:42,383 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 3,
cluster_timestamp: 1371849977601, }, attemptId: 1, }, id: 4, }, state:
C_RUNNING, diag
nostics: "", exit_status: -1000,
10:28:42,383 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 4,
cluster_timestamp: 1371849977601, }, attemptId: 1, }, id: 2, }, state:
C_RUNNING, diag
nostics: "", exit_status: -1000,
10:28:43,389 INFO Container:835 - Container
container_1371849977601_0001_02_000001 transitioned from KILLING to
CONTAINER_CLEANEDUP_AFTER_KILL
10:28:43,390 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 1,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 1, }, state:
C_RUNNING, diag
nostics: "Container killed by the ApplicationMaster.\nContainer killed on
request. Exit code is 137\n", exit_status: 137,
10:28:43,390 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 1,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 3, }, state:
C_RUNNING, diagnostics: "", exit_status: -1000,
10:28:43,390 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 2,
cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 3, }, state:
C_RUNNING, diagnostics: "", exit_status: -1000,
10:28:43,390 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 3,
cluster_timestamp: 1371849977601, }, attemptId: 1, }, id: 4, }, state:
C_RUNNING, diagnostics: "", exit_status: -1000,
10:28:43,390 INFO NodeStatusUpdaterImpl:265 - Sending out status for
container: container_id {, app_attempt_id {, application_id {, id: 4,
cluster_timestamp: 1371849977601, }, attemptId: 1, }, id: 2, }, state:
C_RUNNING, diagnostics: "", exit_status: -1000,
10:28:43,448 INFO NMAuditLogger:89 - USER=criccomi OPERATION=Container
Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS
APPID=application_1371849977601_0001
CONTAINERID=container_1371849977601_0001_02_000001
10:28:43,468 INFO LinuxContainerExecutor:308 - Deleting absolute path :
/path/to/yarn-data/usercache/criccomi/appcache/application_1371849977601_0001/container_1371849977601_0001_02_000001
10:28:43,481 INFO LinuxContainerExecutor:318 - -- DEBUG -- deleteAsUser:
[/path/to/yarn/i001/bin/container-executor, criccomi, 3,
/path/to/yarn-data/usercache/criccomi/appcache/application_1371849977601_0001/container_1371849977601_0001_02_000001]
10:28:43,556 INFO Container:835 - Container
container_1371849977601_0001_02_000001 transitioned from
CONTAINER_CLEANEDUP_AFTER_KILL to DONE
10:28:43,666 INFO Application:321 - Removing
container_1371849977601_0001_02_000001 from application
application_1371849977601_0001
10:28:44,391 INFO NodeManager:253 - Done waiting for containers to be killed.
Still alive: [container_1371849977601_0001_02_000001,
container_1371849977601_0001_02_000003, container_1371849977601_0002_02_000003,
container_1371849977601_0003_01_000004, container_1371849977601_0004_01_000002]
10:28:44,559 INFO AbstractService:113 -
Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is
stopped.
10:28:44,861 WARN AsyncDispatcher:109 - Interrupted Exception while stopping
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1143)
at java.lang.Thread.join(Thread.java:1196)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
at
org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
at
org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:219)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:347)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
at java.lang.Thread.run(Thread.java:619)
10:28:45,162 INFO AbstractService:113 - Service:Dispatcher is stopped.
10:28:44,913 INFO ContainersMonitorImpl:347 - Stopping resource-monitoring for
container_1371849977601_0001_02_000001
10:28:45,196 INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 4324
for container-id container_1371849977601_0001_02_000003: 522.9 MB of 768 MB
physical memory used; 1.1 GB of 6 GB virtual memory used
10:28:45,230 INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 5717
for container-id container_1371849977601_0002_02_000003: 608.3 MB of 1.3 GB
physical memory used; 1.6 GB of 10 GB virtual memory used
10:28:45,266 INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree
26908 for container-id container_1371849977601_0004_01_000002: 16.4 GB of 19.3
GB physical memory used; 17.0 GB of 154 GB virtual memory used
10:28:45,291 INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree
27868 for container-id container_1371849977601_0003_01_000004: 4.2 GB of 6.1 GB
physical memory used; 5.4 GB of 49 GB virtual memory used
10:28:45,608 INFO log:67 - Stopped [email protected]:9999
10:28:45,709 INFO AbstractService:113 -
Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is stopped.
10:28:45,709 INFO Server:2060 - Stopping server on 45454
10:28:45,718 INFO Server:654 - Stopping IPC Server listener on 45454
10:28:45,759 INFO Server:796 - Stopping IPC Server Responder
10:28:45,977 INFO AbstractService:113 -
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler
is stopped.
10:28:45,978 INFO AbstractService:113 - Service:Dispatcher is stopped.
10:28:45,978 WARN ContainersMonitorImpl:463 -
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
is interrupted. Exiting.
10:28:45,978 INFO AbstractService:113 - Service:containers-monitor is stopped.
10:28:45,978 INFO AbstractService:113 -
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices
is stopped.
10:28:45,979 INFO AbstractService:113 - Service:containers-launcher is stopped.
10:28:45,979 INFO Server:2060 - Stopping server on 4344
10:28:45,995 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at:
/cgroup/cpu/hadoop-yarn/container_1371849977601_0002_02_000003
10:28:45,996 WARN ContainerLaunch:247 - Failed to launch container.
java.io.IOException: java.lang.InterruptedException
at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
at org.apache.hadoop.util.Shell.run(Shell.java:129)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
10:28:46,076 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at:
/cgroup/cpu/hadoop-yarn/container_1371849977601_0003_01_000004
10:28:46,076 WARN ContainerLaunch:247 - Failed to launch container.
java.io.IOException: java.lang.InterruptedException
at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
at org.apache.hadoop.util.Shell.run(Shell.java:129)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
10:28:46,076 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at:
/cgroup/cpu/hadoop-yarn/container_1371849977601_0004_01_000002
10:28:46,076 WARN ContainerLaunch:247 - Failed to launch container.
java.io.IOException: java.lang.InterruptedException
at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
at org.apache.hadoop.util.Shell.run(Shell.java:129)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
10:28:46,109 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at:
/cgroup/cpu/hadoop-yarn/container_1371849977601_0001_02_000003
10:28:46,109 WARN ContainerLaunch:247 - Failed to launch container.
java.io.IOException: java.lang.InterruptedException
at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
at org.apache.hadoop.util.Shell.run(Shell.java:129)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
10:28:46,125 INFO Server:796 - Stopping IPC Server Responder
10:28:46,125 INFO Server:654 - Stopping IPC Server listener on 4344
10:28:46,383 INFO AbstractService:113 -
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker
is stopped.
10:28:46,383 INFO AbstractService:113 -
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService
is stopped.
10:28:46,383 INFO AbstractService:113 -
Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
is stopped.
10:28:46,383 INFO AbstractService:113 -
Service:org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is
stopped.
10:28:46,383 INFO AbstractService:113 -
Service:org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService is
stopped.
10:28:46,384 INFO AbstractService:113 -
Service:org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService is
stopped.
10:28:46,384 INFO AbstractService:113 -
Service:org.apache.hadoop.yarn.server.nodemanager.DeletionService is stopped.
10:28:46,384 INFO AbstractService:113 -
Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is stopped.
10:28:46,384 INFO MetricsSystemImpl:200 - Stopping NodeManager metrics
system...
10:28:46,385 INFO MetricsSystemImpl:206 - NodeManager metrics system stopped.
10:28:46,385 INFO MetricsSystemImpl:572 - NodeManager metrics system shutdown
complete.
10:28:46,385 INFO NodeManager:315 - Rebooting the node manager.
{noformat}
Cheers,
Chris
> YARN NM leaking containers with CGroups
> ---------------------------------------
>
> Key: YARN-864
> URL: https://issues.apache.org/jira/browse/YARN-864
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.0.5-alpha
> Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and
> YARN-600.
> Reporter: Chris Riccomini
> Attachments: rm-log
>
>
> Hey Guys,
> I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm
> seeing containers getting leaked by the NMs. I'm not quite sure what's going
> on -- has anyone seen this before? I'm concerned that maybe it's a
> mis-understanding on my part about how YARN's lifecycle works.
> When I look in my AM logs for my app (not an MR app master), I see:
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100.
> This means that container container_1371141151815_0008_03_000002 was killed
> by YARN, either due to being released by the application master or being
> 'lost' due to node failures etc.
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container
> container_1371141151815_0008_03_000002 was assigned task ID 0. Requesting a
> new container for the task.
> The AM has been running steadily the whole time. Here's what the NM logs say:
> {noformat}
> 05:34:59,783 WARN AsyncDispatcher:109 - Interrupted Exception while stopping
> java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1143)
> at java.lang.Thread.join(Thread.java:1196)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
> at
> org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
> at
> org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
> at java.lang.Thread.run(Thread.java:619)
> 05:35:00,314 WARN ContainersMonitorImpl:463 -
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> is interrupted. Exiting.
> 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup
> at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598
> 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup
> at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_000002
> 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container.
> java.io.IOException: java.lang.InterruptedException
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
> at org.apache.hadoop.util.Shell.run(Shell.java:129)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container.
> java.io.IOException: java.lang.InterruptedException
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
> at org.apache.hadoop.util.Shell.run(Shell.java:129)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> {noformat}
> And, if I look on the machine that's running
> container_1371141151815_0008_03_000002, I see:
> {noformat}
> $ ps -ef | grep container_1371141151815_0008_03_000002
> criccomi 5365 27915 38 Jun18 ? 21:35:05
> /export/apps/jdk/JDK-1_6_0_21/bin/java -cp
> /path-to-yarn-data-dir/usercache/criccomi/appcache/application_1371141151815_0008/container_1371141151815_0008_03_000002/...
> {noformat}
> The same holds true for container_1371141151815_0006_01_001598. When I look
> in the container logs, it's just happily running. No kill signal appears to
> be sent, and no error appears.
> Lastly, the RM logs show no major events around the time of the leak
> (5:35am). I am able to reproduce this simply by waiting about 12 hours, or
> so, and it seems to have started happening after I switched over to CGroups
> and LCE, and turned on stateful RM (using file system).
> Any ideas what's going on?
> Thanks!
> Chris
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira