[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576916#comment-16576916
 ] 

Jim Brennan commented on YARN-8648:
-----------------------------------

One proposal to fix the leaking cgroups is to have docker put its containers 
directly under the 
{{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}} directory. For 
example, instead of using {{cgroup-parent=/hadoop-yarn/container_id-}}, we use 
{{cgroup-parent=/hadoop-yarn}}. This does cause docker to create a 
{{hadoop-yarn}} cgroup under each resource type, and it does not clean those 
up, but that is just one unused cgroup per resource type vs hundreds of 
thousands.

This can be done by just passing an empty string to 
DockerLinuxContainerRuntime.addCGroupParentIfRequired(), or otherwise changing 
it to ignore the containerIdStr. Doing this and removing the code that 
cherry-picks the PID in container-executor does work, but the NM still creates 
the per-container cgroups as well - they're just not used. The other issue with 
this approach is that the cpu.shares is still updated (to reflect the requested 
vcores allotment) in the per-container cgroup, so it is ignored. In our code, 
we addressed this by passing the cpu.shares value in the docker run 
--cpu-shares command line argument.

I'm still thinking about the best way to address this. Currently most of the 
resourceHandler processing happens at the linuxContainerExecutor level. But 
there is clearly a difference in how cgroups need to be handled for docker vs 
linux cases. In the docker case, we should arguably use docker command line 
arguments instead of directly setting up cgroups.

One option would be to provide a runtime interface useResourceHandlers() which 
for Docker would return false. We could then disable all of the resource 
handling processing that happens in the container executor, and add the 
necessary interfaces to handle cgroup parameters to the docker runtime.

Another option would be to move the resource handler processing down into the 
runtime. This is a bigger change, but may be cleaner. The docker runtime may 
still just ignore those handlers, but that detail would be hidden at the 
container executor level.

cc:, [~ebadger] [~jlowe] [~eyang] [[email protected]] [~billie.rinaldi]

 

> Container cgroups are leaked when using docker
> ----------------------------------------------
>
>                 Key: YARN-8648
>                 URL: https://issues.apache.org/jira/browse/YARN-8648
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Major
>              Labels: Docker
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.    So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to