[
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296729#comment-15296729
]
Jason Lowe commented on YARN-4459:
----------------------------------
Sorry to arrive to this late. I agree that we should be killing the session
and not the pid. It's not a perfect solution, but it _drastically_ reduces the
likelihood of the wrong process getting killed. This could be improved upon by
adding a just-before-kill check of some sort and/or proactive cancelling of the
timer when we see the child process exit before the SIGKILL is sent. However
rather than holding up this significant improvement waiting for those things to
be added, I propose we add this now and further iterate on it in a subsequent
JIRA.
+1 for the patch. Will commit this in a couple of days if there are no
objections.
> container-executor might kill process wrongly
> ---------------------------------------------
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Jun Gong
> Assignee: Jun Gong
> Attachments: YARN-4459.01.patch, YARN-4459.02.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first
> checks whether process group exists, if not, it will kill the process
> itself(if it the process exists). It is not reasonable because that the
> process group does not exist means corresponding container has finished, if
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for
> starting NM and submitted app, and container-executor sometimes killed NM(the
> wrongly killed process might just be a newly started thread and was NM's
> child process).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]