[
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458806#comment-16458806
]
Miklos Szegedi commented on YARN-4599:
--------------------------------------
I will provide the patch shortly. Here are the design suggestions of the patch:
* The basic idea is what was discussed above. It disables the OOM killer on
the hadoop-yarn cgroup. This will trigger a pause on all containers when all of
them exceeded the node limit.
* YARN will be notified with an executable listening to the cgroups OOM Linux
event. This should be very fast. The executable is oom-listener, not
container-executor. This is because it does not need to run as root. I avoided
JNI to be more defensive on security and it also helps to test the executable
easier.
* When YARN receives the notification, it runs a pluggable OOM handler to
resolve the situation. YARN is outside the hadoop-yarn group, so it can run
freely, all containers are frozen at this point. Different users may have
different preferences, thus the handler is pluggable.
* The default OOM handler picks the latest container that ran out of it's
request. This ensures that it kills a container that did not cost much so far
but it keeps guaranteed containers that play by the rules and use memory within
their limits. It repeats the process until the OOM is resolved. Based on my
experiments the kernel updates the flag almost instantaneously, so it will just
kill as many containers as necessary.
* If the default OOM handler cannot pick a container with the logic above it
kills the latest container until the OOM is resolved.
* If we are still in OOM without any containers, an exception is thrown and
the node is brought down. This can be the case, if containers leaked processes,
had processes running with another user and cannot be killed the container user
or someone put a process into the root hadoop-yarn cgroup.
* The killer is not the normal container cleanup code. The standard behaviour
is to send a SIGTERM to the container PGID process and if it does not respond
in 250 milliseconds, it sends a SIGKILL. However, in our case all the processes
are frozen by cgroups, so that they cannot respond to a SIGTERM. Because of
this it uses the standard container executor code to send a SIGKILL to the PGID
right away with the container user. The kernel OOM killer would do the same.
This works pretty fast. It walks through all the thread/process IDs in the
tasks file, so that all active PGIDs are found in the container. The current
code does not delete standalone processes that are not a process group leader.
If they are not part of one of the container local process groups they may be
leaked. It also cannot handle processes that are running as different users in
than the container user in the cgroup of the container. This should be rare.
* The code adds a watchdog to measure the time to resolve an OOM situation.
The time to resolve an OOM situation takes 10-160 milliseconds based on my
experiments.
* The patch contains documentation to set up and troubleshoot the feature.
* I was able to test it manually but I did not do large scale and longhaul
tests, yet.
> Set OOM control for memory cgroups
> ----------------------------------
>
> Key: YARN-4599
> URL: https://issues.apache.org/jira/browse/YARN-4599
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.9.0
> Reporter: Karthik Kambatla
> Assignee: Miklos Szegedi
> Priority: Major
> Labels: oct16-medium
> Attachments: YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly
> set OOM control so that containers are not killed as soon as they go over
> their usage. Today, one could set the swappiness to control this, but
> clusters with swap turned off exist.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]