[
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584479#comment-16584479
]
Jim Brennan commented on YARN-8648:
-----------------------------------
I have been experimenting with the following incomplete approach:
* CGroupsHandler
** Add missing controllers to the list of supported controllers
** Add initializeAllCGroupControllers()
*** Initializes all of the cgroups controllers that were not already
initialized by a ResourceHandler - this is mainly creating the hierarchy
(hadoop-yarn) cgroup or verifying that it is there and writable.
** Add CreateCGroupAllControllers(containerId)
*** Creates the containerId cgroup under all cgroup controllers
** Add DeleteCGroupAllControllers(containerId)
*** Deletes the containerId cgroup under all cgroup controllers
* ResourceHandlerModule
** Add wrappers to call the above methods.
* LinuxContainerExecutor
** Add calls to above methods if the runtime is Docker (would probably be
better to move these to the runtime)
So far I have been testing with pre-mounted cgroup hierarchies. That is, I
manually created the hadoop-yarn cgroup under each controller.
I've run into several problems experimenting with this approach on RHEL 7:
* The hadoop-yarn cgroup under the following controllers is being deleted by
the system (when I let it sit idle for a while): blkio, devices, memory, pids
** I got around this for now by just not adding pids to the list and skipping
the others in the new methods. We are not leaking cgroups for these
controllers.
* I am still leaking cgroups under /sys/fs/cgroup/systemd
** Even if I add "systemd" as one of the supported controllers, our mount-tab
parsing code does not find it because it's not really a controller.
* This feels pretty hacky - it might be better to just add a new
dockerCGroupResourceHandler (as I mentioned above) to do effectively the same
thing - we'd have to supply the list of controllers in a config property and
deal with systemd. The way things are right now we would still have to add
these to the list of supported controllers, because most of the interfaces are
based on a controller enum. But even moving it to a separateResourceHandler
still seems hacky.
* I haven't tested the mount-cgroup path yet, but I believe we would need to
configure all of the controllers that we need to mount in
container-executor.cfg.
The main advantage to something along these lines is that it preserves the
existing cgroups hiearchy, and no additional code is needed to deal with cgroup
parameters. The other advantage is that we are pre-creating the hadoop-yarn
cgroups with the correct owner/permissions - docker creates them as root.
At this point, I'm not sure if I should proceed with this approach and I'm
looking for opinions.
The options I am considering are:
# The approach I've been experimenting with, cleaned up
# The minimal, just-fix-the-leak approach, which would be to add a
cleanupCGroups() method to the runtime.
** We call it after calling the ResourceHandlers.postComplete() in LCE.
** Docker would be the only runtime that implements it.
** We'd need to add a container-executor function to handle it.
** It could search for the containerId cgroup under all mounted cgroups and
delete any that it finds
*** Would not delete any that still have processes
*** Security concerns?
# The let-docker-be-docker approach
** This is the change-the-cgroup-parent approach. Instead of passing
/hadoop-yarn/containerId, we would just use /hadoop-yarn and let docker create
its dockerContainerId cgroups under there.
** Solves the leak by just letting docker handle it - no intermediate
containerId cgroups are created, so they don't need to be deleted by NM.
** To do this, I think we'd need to change every Cgroups ResourceHandler to do
something different for Docker. The main ones are for blkio and cpu.
*** Don't create the containerId cgroups
*** Don't modify cgroup params directly.
*** Return the /hadoop-yarn/tasks path for the ADD_PID_TO_CGROUP operation so
we set the cgroup parent correctly.
*** Would likely need to add new PriviledgedOps for each cgroup parameter to
pass them through (these are returned by ResourceHandler.preStart()).
*** Add code to add each new cgroup parameter to docker run.
*** Would need to support updating params via docker update command to support
the ResourceHandler.updateContainer() method.
*** [~billie.rinaldi], I've thought a bit more about the docker in docker
case, which we thought would be a problem with this approach. I think it is
solvable though - you can obtain the name of the docker cgroup from
/proc/self/cgroup. I don't know if this is workable for your use-case though?
Comments? Concerns? Alternatives?
cc:[~jlowe], [~ebadger], [[email protected]], [~billie.rinaldi], [~eyang]
> Container cgroups are leaked when using docker
> ----------------------------------------------
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Jim Brennan
> Assignee: Jim Brennan
> Priority: Major
> Labels: Docker
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups
> for all resources on the system, not just for cpu. For instance, if the
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}},
> the nodemanager will create a cgroup for each container under
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path
> via the {{--cgroup-parent}} command line argument. Docker then creates a
> cgroup for the docker container under that, for instance:
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}}
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every
> resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio,
> perf_event, and systemd. So for instance, docker creates
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but
> it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up
> the {{container_id}} cgroups for these other resources. On one of our busy
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to
> repro with current hadoop.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]