[
https://issues.apache.org/jira/browse/YARN-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219536#comment-16219536
]
Eric Badger commented on YARN-7395:
-----------------------------------
Here's the relevant lines from the NM log
{noformat}
2017-10-25 20:03:07,549 [Container Monitor] WARN monitor.ContainersMonitorImpl:
Process tree for container: container_e126_1508911755032_0004_02_000001 has
processes older than 1 iteration running over the configured limit.
Limit=536870912, current usage = 585281536
2017-10-25 20:03:07,551 [Container Monitor] WARN monitor.ContainersMonitorImpl:
Container [pid=29030,containerID=container_e126_1508911755032_0004_02_000001]
is running beyond physical memory limits. Current usage: 558.2 MB of 512 MB
physical memory used; 2.8 GB of 1.0 GB virtual memory used. Killing container.
Dump of the process-tree for container_e126_1508911755032_0004_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 29065 29030 29030 29030 (java) 6022 290 2962636800 142606 /bin/java
-Djava.io.tmpdir=/tmp/yarn-local/usercache/ebadger/appcache/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/tmp
-Dlog4j.configuration=container-log4j.properties
-Dyarn.app.container.log.dir=/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA
-Dhadoop.root.logfile=syslog
-XX:ErrorFile=/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/hs_err_pid%p.log
-XX:GCTimeLimit=50 -XX:ParallelGCThreads=4 -XX:NewRatio=8
-Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps
-Xloggc:/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/gc.log
-Xmx1024m -XX:NewRatio=8 -Djava.net.preferIPv4Stack=true
org.apache.hadoop.mapreduce.v2.app.MRAppMaster
|- 29030 29014 29030 29030 (bash) 3 2 9474048 285 /bin/bash -c
/bin/java
-Djava.io.tmpdir=/tmp/yarn-local/usercache/ebadger/appcache/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/tmp
-Dlog4j.configuration=container-log4j.properties
-Dyarn.app.container.log.dir=/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA
-Dhadoop.root.logfile=syslog
-XX:ErrorFile=/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/hs_err_pid%p.log
-XX:GCTimeLimit=50 -XX:ParallelGCThreads=4 -XX:NewRatio=8
-Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps
-Xloggc:/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/gc.log
-Xmx1024m -XX:NewRatio=8 -Djava.net.preferIPv4Stack=true
org.apache.hadoop.mapreduce.v2.app.MRAppMaster
1>/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/stdout
2>/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/stderr
2017-10-25 20:03:07,551 [Container Monitor] INFO monitor.ContainersMonitorImpl:
Removed ProcessTree with root 29030
2017-10-25 20:03:07,551 [AsyncDispatcher event handler] INFO
container.ContainerImpl: Container container_e126_1508911755032_0004_02_000001
transitioned from RUNNING to KILLING
2017-10-25 20:03:07,552 [AsyncDispatcher event handler] INFO
launcher.ContainerLaunch: Cleaning up container
container_e126_1508911755032_0004_02_000001
2017-10-25 20:03:07,576 [AsyncDispatcher event handler] WARN
nodemanager.LinuxContainerExecutor: Error in signalling container 29030 with
SIGTERM; exit = 1
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
Signal container failed
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.signalContainer(DockerLinuxContainerRuntime.java:615)
at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:510)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:473)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:140)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:56)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
at java.lang.Thread.run(Thread.java:745)
2017-10-25 20:03:07,576 [AsyncDispatcher event handler] INFO
nodemanager.ContainerExecutor: Using command stop
'container_e126_1508911755032_0004_02_000001'
2017-10-25 20:03:07,576 [AsyncDispatcher event handler] WARN
launcher.ContainerLaunch: Exception when trying to cleanup container
container_e126_1508911755032_0004_02_000001: java.io.IOException: Problem
signalling container 29030 with SIGTERM; output: Using command stop
'container_e126_1508911755032_0004_02_000001'
and exitCode: 1
at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:521)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:473)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:140)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:56)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
at java.lang.Thread.run(Thread.java:745)
Caused by:
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
Signal container failed
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.signalContainer(DockerLinuxContainerRuntime.java:615)
at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:510)
... 6 more
{noformat}
> NM fails to successfully kill tasks that run over their memory limit
> --------------------------------------------------------------------
>
> Key: YARN-7395
> URL: https://issues.apache.org/jira/browse/YARN-7395
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: yarn
> Reporter: Eric Badger
>
> The NM correctly notes that the container is over its configured limit, but
> then fails to successfully kill the process. So the Docker container AM stays
> around and the job keeps running
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]