Hu Ziqian created YARN-8382:
-------------------------------

             Summary: cgroup file leak in NM
                 Key: YARN-8382
                 URL: https://issues.apache.org/jira/browse/YARN-8382
             Project: Hadoop YARN
          Issue Type: Bug
          Components: nodemanager
         Environment: we write an container with a shutdownHook which has a 
piece of code like  "while(true) sleep(100)" . when 
*yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms <* 
*yarn.nodemanager.sleep-delay-before-sigkill.ms , cgourp file leak happens; 
when* *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms >* 
** *yarn.nodemanager.sleep-delay-before-sigkill.ms, cgroup file is deleted 
successfully***
            Reporter: Hu Ziqian
            Assignee: Hu Ziqian


As Jiandan said in YARN-6525, NM may delete  Cgroup container file timeout with 
logs like

org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_xxx, tried to 
delete for 1000ms

 

we found one situation is that when we set 
*yarn.nodemanager.sleep-delay-before-sigkill.ms* bigger than 
yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms, the cgroup 
file leak happens *.* 

 

One container process tree looks like follow graph:

{{bash(16097)───java(16099)─┬─\{java}(16100) }}

{{                             ├─\{java}(16101) }}

{{                             ├─\{java}(16102)}}

 

{{when NM kill a container, NM send kill -15 -pid to kill container process 
group. Bash process will exit when it received sigterm, but java process may do 
some job (shutdownHook etc.), and may exit unit receive sigkill. And when bash 
process exit, CgroupsLCEResourcesHandler begin to try to delete cgroup. So when 
*yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* arrived, 
the java processes may still running and cgourp/tasks still not empty and cause 
a cgroup file leak.}}

 

{{we add a condition that 
*yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* must 
bigger than *yarn.nodemanager.sleep-delay-before-sigkill.ms* to solve this 
problem.}}

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to