[
https://issues.apache.org/jira/browse/YARN-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659526#comment-16659526
]
Haibo Chen commented on YARN-8929:
----------------------------------
The patch does three things
1) To kill a container, read its cgroup.procs file instead of tasks file to
seed the processes that belong to the given container. The tasks file contains
all thread ids too. For pids, we should read cgroup.procs file.
2) Avoid killing a container that is not running yet, because killing a
non-running container won't release any memory to help the oom issue. Note from
NM's perspective, a container is running as long as it is sent to container
executor to launch, so a container can be thought of running by NM and its
process may have not started.
3) Always try to kill one container successfully. Killing a running container
may not be successful (one reason is mentioned in 2)), so DefaultOOMHandler now
tries to kill containers until one is killed successful to ensure some memory
is released.
> DefaultOOMHandler should only pick running containers to kill upon oom events
> -----------------------------------------------------------------------------
>
> Key: YARN-8929
> URL: https://issues.apache.org/jira/browse/YARN-8929
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 3.2.0
> Reporter: Haibo Chen
> Assignee: Haibo Chen
> Priority: Major
> Attachments: YARN-8929.00.patch
>
>
> DefaultOOMHandler currently currently sort all known containers based on the
> execution type primarily and their start time secondarily.
> However, it does not check if a container is running or not. Kill a
> non-running container will not release any memory, hence won't get us of the
> under-oom status.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]