[ 
https://issues.apache.org/jira/browse/YARN-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458806#comment-16458806
 ] 

Miklos Szegedi commented on YARN-4599:
--------------------------------------

I will provide the patch shortly. Here are the design suggestions of the patch:
 * The basic idea is what was discussed above. It disables the OOM killer on 
the hadoop-yarn cgroup. This will trigger a pause on all containers when all of 
them exceeded the node limit.
 * YARN will be notified with an executable listening to the cgroups OOM Linux 
event. This should be very fast. The executable is oom-listener, not 
container-executor. This is because it does not need to run as root. I avoided 
JNI to be more defensive on security and it also helps to test the executable 
easier.
 * When YARN receives the notification, it runs a pluggable OOM handler to 
resolve the situation. YARN is outside the hadoop-yarn group, so it can run 
freely, all containers are frozen at this point. Different users may have 
different preferences, thus the handler is pluggable.
 * The default OOM handler picks the latest container that ran out of it's 
request. This ensures that it kills a container that did not cost much so far 
but it keeps guaranteed containers that play by the rules and use memory within 
their limits. It repeats the process until the OOM is resolved. Based on my 
experiments the kernel updates the flag almost instantaneously, so it will just 
kill as many containers as necessary.
 * If the default OOM handler cannot pick a container with the logic above it 
kills the latest container until the OOM is resolved.
 * If we are still in OOM without any containers, an exception is thrown and 
the node is brought down. This can be the case, if containers leaked processes, 
had processes running with another user and cannot be killed the container user 
or someone put a process into the root hadoop-yarn cgroup.
 * The killer is not the normal container cleanup code. The standard behaviour 
is to send a SIGTERM to the container PGID process and if it does not respond 
in 250 milliseconds, it sends a SIGKILL. However, in our case all the processes 
are frozen by cgroups, so that they cannot respond to a SIGTERM. Because of 
this it uses the standard container executor code to send a SIGKILL to the PGID 
right away with the container user. The kernel OOM killer would do the same. 
This works pretty fast. It walks through all the thread/process IDs in the 
tasks file, so that all active PGIDs are found in the container. The current 
code does not delete standalone processes that are not a process group leader. 
If they are not part of one of the container local process groups they may be 
leaked. It also cannot handle processes that are running as different users in 
than the container user in the cgroup of the container. This should be rare.
 * The code adds a watchdog to measure the time to resolve an OOM situation. 
The time to resolve an OOM situation takes 10-160 milliseconds based on my 
experiments.
 * The patch contains documentation to set up and troubleshoot the feature.
 * I was able to test it manually but I did not do large scale and longhaul 
tests, yet.

> Set OOM control for memory cgroups
> ----------------------------------
>
>                 Key: YARN-4599
>                 URL: https://issues.apache.org/jira/browse/YARN-4599
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.9.0
>            Reporter: Karthik Kambatla
>            Assignee: Miklos Szegedi
>            Priority: Major
>              Labels: oct16-medium
>         Attachments: YARN-4599.sandflee.patch, yarn-4599-not-so-useful.patch
>
>
> YARN-1856 adds memory cgroups enforcing support. We should also explicitly 
> set OOM control so that containers are not killed as soon as they go over 
> their usage. Today, one could set the swappiness to control this, but 
> clusters with swap turned off exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to