[
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16599213#comment-16599213
]
Jim Brennan commented on YARN-8648:
-----------------------------------
[~jlowe] thanks for the review!
{quote}Why was the postComplete call moved in reapContainer to before the
container is removed via docker? Shouldn't docker first remove its cgroups for
the container before we remove ours?
{quote}
I was trying to preserve the order of operations. Normally postComplete is
called immediately after the launchContainer() returns, and then
reapContainer() is called as part of cleanupContainer() processing.
So the resource handlers usually get a chance to clean up cgroups before we
cleanup the container. If we do the docker cleanup first, it will delete the
cgroups before the resource handler postComplete processing - it doesn't know
which ones are handled by resource handlers, so it just deletes them all. Since
they both are really just deleting the cgroups, I don't think the order matters
that much, so I will move it back if you think that is better.
{quote}Is there a reason to separate removing docker cgroups from removing the
docker container? This seems like a natural extension to cleaning up after a
container run by docker, and that's already covered by the reap command. The
patch would remain a docker-only change but without needing to modify the
container-executor interface.
{quote}
It is currently being done as part of the reap processing, but as a separate
privileged operation. We definitely could just add this processing to the
remove-docker-container processing in container-executor, but it would require
adding the yarn-hierarchy as an additional argument for the DockerRmCommand.
This would also require changing the DockerContainerDeletionTask() to store the
yarn-hierarchy String along with the ContainerId. Despite the additional
container-executor interface, I think the current approach is less
code/simpler, but I'm definitely willing to rework it if you think it is a
better solution.
{quote}Nit: PROC_MOUNT_PATH should be a macro (i.e.: #define) or lower-cased.
Similar for CGROUP_MOUNT.
{quote}
I will fix these.
{quote}The snprintf result should be checked for truncation in addition to
output errors (i.e.: result >= PATH_MAX means it was truncated) otherwise we
formulate an incomplete path targeted for deletion if that somehow occurs.
Alternatively the code could use make_string or asprintf to allocate an
appropriately sized buffer for each entry rather than trying to reuse a
manually sized buffer.
{quote}
I will fix this. I forgot about make_string().
{quote}Is there any point in logging to the error file that a path we want to
delete has already been deleted? This seems like it will just be noise,
especially if systemd or something else is periodically cleaning some of these
empty cgroups.
{quote}
I'll remove it - was nice while debugging, but not needed.
{quote}Related to the previous comment, the rmdir result should be checked for
ENOENT and treat that as success.
{quote}
I explicitly check that the directory exists before calling rmdir, so I'm not
sure this is necessary, but I can add it anyway.
{quote}Nit: I think lineptr should be freed in the cleanup label in case
someone later adds a fatal error that jumps to cleanup.
{quote}
Will do.
Thanks again for the review!
> Container cgroups are leaked when using docker
> ----------------------------------------------
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Jim Brennan
> Assignee: Jim Brennan
> Priority: Major
> Labels: Docker
> Attachments: YARN-8648.001.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups
> for all resources on the system, not just for cpu. For instance, if the
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}},
> the nodemanager will create a cgroup for each container under
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path
> via the {{--cgroup-parent}} command line argument. Docker then creates a
> cgroup for the docker container under that, for instance:
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}}
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every
> resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio,
> perf_event, and systemd. So for instance, docker creates
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but
> it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up
> the {{container_id}} cgroups for these other resources. On one of our busy
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to
> repro with current hadoop.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]