[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16599213#comment-16599213
 ] 

Jim Brennan commented on YARN-8648:
-----------------------------------

[~jlowe] thanks for the review!
{quote}Why was the postComplete call moved in reapContainer to before the 
container is removed via docker? Shouldn't docker first remove its cgroups for 
the container before we remove ours?
{quote}
I was trying to preserve the order of operations. Normally postComplete is 
called immediately after the launchContainer() returns, and then 
reapContainer() is called as part of cleanupContainer() processing.
 So the resource handlers usually get a chance to clean up cgroups before we 
cleanup the container. If we do the docker cleanup first, it will delete the 
cgroups before the resource handler postComplete processing - it doesn't know 
which ones are handled by resource handlers, so it just deletes them all. Since 
they both are really just deleting the cgroups, I don't think the order matters 
that much, so I will move it back if you think that is better.
{quote}Is there a reason to separate removing docker cgroups from removing the 
docker container? This seems like a natural extension to cleaning up after a 
container run by docker, and that's already covered by the reap command. The 
patch would remain a docker-only change but without needing to modify the 
container-executor interface.
{quote}
It is currently being done as part of the reap processing, but as a separate 
privileged operation. We definitely could just add this processing to the 
remove-docker-container processing in container-executor, but it would require 
adding the yarn-hierarchy as an additional argument for the DockerRmCommand. 
This would also require changing the DockerContainerDeletionTask() to store the 
yarn-hierarchy String along with the ContainerId. Despite the additional 
container-executor interface, I think the current approach is less 
code/simpler, but I'm definitely willing to rework it if you think it is a 
better solution.
{quote}Nit: PROC_MOUNT_PATH should be a macro (i.e.: #define) or lower-cased. 
Similar for CGROUP_MOUNT.
{quote}
I will fix these.
{quote}The snprintf result should be checked for truncation in addition to 
output errors (i.e.: result >= PATH_MAX means it was truncated) otherwise we 
formulate an incomplete path targeted for deletion if that somehow occurs. 
Alternatively the code could use make_string or asprintf to allocate an 
appropriately sized buffer for each entry rather than trying to reuse a 
manually sized buffer.
{quote}
I will fix this. I forgot about make_string().
{quote}Is there any point in logging to the error file that a path we want to 
delete has already been deleted? This seems like it will just be noise, 
especially if systemd or something else is periodically cleaning some of these 
empty cgroups.
{quote}
I'll remove it - was nice while debugging, but not needed.
{quote}Related to the previous comment, the rmdir result should be checked for 
ENOENT and treat that as success.
{quote}
I explicitly check that the directory exists before calling rmdir, so I'm not 
sure this is necessary, but I can add it anyway.
{quote}Nit: I think lineptr should be freed in the cleanup label in case 
someone later adds a fatal error that jumps to cleanup.
{quote}
Will do.

Thanks again for the review!

> Container cgroups are leaked when using docker
> ----------------------------------------------
>
>                 Key: YARN-8648
>                 URL: https://issues.apache.org/jira/browse/YARN-8648
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Major
>              Labels: Docker
>         Attachments: YARN-8648.001.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.    So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to