[
https://issues.apache.org/jira/browse/YARN-9929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
kyungwan nam updated YARN-9929:
-------------------------------
Attachment: nm_heapdump.png
> NodeManager OOM because of stuck DeletionService
> ------------------------------------------------
>
> Key: YARN-9929
> URL: https://issues.apache.org/jira/browse/YARN-9929
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.1.2
> Reporter: kyungwan nam
> Assignee: kyungwan nam
> Priority: Major
> Attachments: nm_heapdump.png
>
>
> NMs go through frequent Full GC due to a lack of heap memory.
> we can find a lot of FileDeletionTask, DockerContainerDeletionTask from the
> heap dump (screenshot is attached)
> and after analyzing the thread dump, we can figure out _DeletionService_ gets
> stuck in _executeStatusCommand_ which run 'docker inspect'
> {code:java}
> "DeletionService #0" - Thread t@41
> java.lang.Thread.State: RUNNABLE
> at java.io.FileInputStream.readBytes(Native Method)
> at java.io.FileInputStream.read(FileInputStream.java:255)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> - locked <649fc0cf> (a java.lang.UNIXProcess$ProcessPipeInputStream)
> at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
> at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
> at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
> - locked <3e45c938> (a java.io.InputStreamReader)
> at java.io.InputStreamReader.read(InputStreamReader.java:184)
> at java.io.BufferedReader.fill(BufferedReader.java:161)
> at java.io.BufferedReader.read1(BufferedReader.java:212)
> at java.io.BufferedReader.read(BufferedReader.java:286)
> - locked <3e45c938> (a java.io.InputStreamReader)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1240)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:995)
> at org.apache.hadoop.util.Shell.run(Shell.java:902)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:91)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:180)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:118)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:937)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Locked ownable synchronizers:
> - locked <4cc6fa2a> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> {code}
> also, we found 'docker inspect' processes are running for a long time as
> follows.
> {code:java}
> root 95637 0.0 0.0 2650984 35776 ? Sl Aug23 5:48
> /usr/bin/docker inspect --format={{.State.Status}}
> container_e30_1555419799458_0014_01_000030
> root 95638 0.0 0.0 2773860 33908 ? Sl Aug23 5:33
> /usr/bin/docker inspect --format={{.State.Status}}
> container_e50_1561100493387_25316_01_001455
> root 95641 0.0 0.0 2445924 34204 ? Sl Aug23 5:34
> /usr/bin/docker inspect --format={{.State.Status}}
> container_e49_1560851258686_2107_01_000024
> root 95643 0.0 0.0 2642532 34428 ? Sl Aug23 5:30
> /usr/bin/docker inspect --format={{.State.Status}}
> container_e50_1561100493387_8111_01_002657{code}
>
> I think It has occurred since docker daemon is restarted.
> 'docker inspect' which was run while restarting the docker daemon was not
> working. and not even it was not terminated.
> It can be considered as a docker issue.
> but It could happen whenever if 'docker inspect' does not work due to docker
> daemon restarting or docker bug.
> It would be good to set the timeout for 'docker inspect' to avoid this issue.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]