kyungwan nam created YARN-9929:
----------------------------------
Summary: NodeManager OOM because of stuck DeletionService
Key: YARN-9929
URL: https://issues.apache.org/jira/browse/YARN-9929
Project: Hadoop YARN
Issue Type: Bug
Affects Versions: 3.1.2
Reporter: kyungwan nam
Assignee: kyungwan nam
NMs go through frequent Full GC due to a lack of heap memory.
we can find a lot of FileDeletionTask, DockerContainerDeletionTask from the
heap dump (screenshot is attached)
and after analyzing the thread dump, we can figure out _DeletionService_ gets
stuck in _executeStatusCommand_ which run 'docker inspect'
{code:java}
"DeletionService #0" - Thread t@41
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
- locked <649fc0cf> (a java.lang.UNIXProcess$ProcessPipeInputStream)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
- locked <3e45c938> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.read1(BufferedReader.java:212)
at java.io.BufferedReader.read(BufferedReader.java:286)
- locked <3e45c938> (a java.io.InputStreamReader)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1240)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:995)
at org.apache.hadoop.util.Shell.run(Shell.java:902)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:91)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:180)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:118)
at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:937)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- locked <4cc6fa2a> (a java.util.concurrent.ThreadPoolExecutor$Worker)
{code}
also, we found 'docker inspect' processes are running for a long time as
follows.
{code:java}
root 95637 0.0 0.0 2650984 35776 ? Sl Aug23 5:48
/usr/bin/docker inspect --format={{.State.Status}}
container_e30_1555419799458_0014_01_000030
root 95638 0.0 0.0 2773860 33908 ? Sl Aug23 5:33
/usr/bin/docker inspect --format={{.State.Status}}
container_e50_1561100493387_25316_01_001455
root 95641 0.0 0.0 2445924 34204 ? Sl Aug23 5:34
/usr/bin/docker inspect --format={{.State.Status}}
container_e49_1560851258686_2107_01_000024
root 95643 0.0 0.0 2642532 34428 ? Sl Aug23 5:30
/usr/bin/docker inspect --format={{.State.Status}}
container_e50_1561100493387_8111_01_002657{code}
I think It has occurred since docker daemon is restarted.
'docker inspect' which was run while restarting the docker daemon was not
working. and not even it was not terminated.
It can be considered as a docker issue.
but It could happen whenever if 'docker inspect' does not work due to docker
daemon restarting or docker bug.
It would be good to set the timeout for 'docker inspect' to avoid this issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]