[ 
https://issues.apache.org/jira/browse/YARN-9929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9929:
-------------------------------
    Attachment: nm_heapdump.png

> NodeManager OOM because of stuck DeletionService
> ------------------------------------------------
>
>                 Key: YARN-9929
>                 URL: https://issues.apache.org/jira/browse/YARN-9929
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.2
>            Reporter: kyungwan nam
>            Assignee: kyungwan nam
>            Priority: Major
>         Attachments: nm_heapdump.png
>
>
> NMs go through frequent Full GC due to a lack of heap memory.
> we can find a lot of FileDeletionTask, DockerContainerDeletionTask from the 
> heap dump (screenshot is attached)
> and after analyzing the thread dump, we can figure out _DeletionService_ gets 
> stuck in _executeStatusCommand_ which run 'docker inspect'
> {code:java}
> "DeletionService #0" - Thread t@41
>    java.lang.Thread.State: RUNNABLE
>       at java.io.FileInputStream.readBytes(Native Method)
>       at java.io.FileInputStream.read(FileInputStream.java:255)
>       at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>       at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>       - locked <649fc0cf> (a java.lang.UNIXProcess$ProcessPipeInputStream)
>       at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>       at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>       at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>       - locked <3e45c938> (a java.io.InputStreamReader)
>       at java.io.InputStreamReader.read(InputStreamReader.java:184)
>       at java.io.BufferedReader.fill(BufferedReader.java:161)
>       at java.io.BufferedReader.read1(BufferedReader.java:212)
>       at java.io.BufferedReader.read(BufferedReader.java:286)
>       - locked <3e45c938> (a java.io.InputStreamReader)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1240)
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:995)
>       at org.apache.hadoop.util.Shell.run(Shell.java:902)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:91)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:180)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:118)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:937)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
>    Locked ownable synchronizers:
>       - locked <4cc6fa2a> (a java.util.concurrent.ThreadPoolExecutor$Worker) 
> {code}
> also, we found 'docker inspect' processes are running for a long time as 
> follows.
> {code:java}
>  root      95637  0.0  0.0 2650984 35776 ?       Sl   Aug23   5:48 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e30_1555419799458_0014_01_000030
> root      95638  0.0  0.0 2773860 33908 ?       Sl   Aug23   5:33 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e50_1561100493387_25316_01_001455
> root      95641  0.0  0.0 2445924 34204 ?       Sl   Aug23   5:34 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e49_1560851258686_2107_01_000024
> root      95643  0.0  0.0 2642532 34428 ?       Sl   Aug23   5:30 
> /usr/bin/docker inspect --format={{.State.Status}} 
> container_e50_1561100493387_8111_01_002657{code}
>  
> I think It has occurred since docker daemon is restarted. 
> 'docker inspect' which was run while restarting the docker daemon was not 
> working. and not even it was not terminated.
> It can be considered as a docker issue.
> but It could happen whenever if 'docker inspect' does not work due to docker 
> daemon restarting or docker bug.
> It would be good to set the timeout for 'docker inspect' to avoid this issue.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to