[
https://issues.apache.org/jira/browse/YARN-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hu Ziqian updated YARN-8382:
----------------------------
Attachment: YARN-8382-branch-2.8.3.001.patch
YARN-8382.001.patch
> cgroup file leak in NM
> ----------------------
>
> Key: YARN-8382
> URL: https://issues.apache.org/jira/browse/YARN-8382
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Environment: we write an container with a shutdownHook which has a
> piece of code like "while(true) sleep(100)" . when
> *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms <*
> *yarn.nodemanager.sleep-delay-before-sigkill.ms , cgourp file leak happens;
> when* *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms >*
> ** *yarn.nodemanager.sleep-delay-before-sigkill.ms, cgroup file is deleted
> successfully***
> Reporter: Hu Ziqian
> Assignee: Hu Ziqian
> Priority: Major
> Attachments: YARN-8382-branch-2.8.3.001.patch, YARN-8382.001.patch
>
>
> As Jiandan said in YARN-6525, NM may delete Cgroup container file timeout
> with logs like
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler:
> Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_xxx, tried to
> delete for 1000ms
>
> we found one situation is that when we set
> *yarn.nodemanager.sleep-delay-before-sigkill.ms* bigger than
> yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms, the
> cgroup file leak happens *.*
>
> One container process tree looks like follow graph:
> bash(16097)───java(16099)─┬─\{java}(16100)
>
> ├─\{java}(16101)
> {{ ├─\{java}(16102)}}
>
> {{when NM kill a container, NM send kill -15 -pid to kill container process
> group. Bash process will exit when it received sigterm, but java process may
> do some job (shutdownHook etc.), and may exit unit receive sigkill. And when
> bash process exit, CgroupsLCEResourcesHandler begin to try to delete cgroup.
> So when *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms*
> arrived, the java processes may still running and cgourp/tasks still not
> empty and cause a cgroup file leak.}}
>
> {{we add a condition that
> *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* must
> bigger than *yarn.nodemanager.sleep-delay-before-sigkill.ms* to solve this
> problem.}}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]