[
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551854#comment-14551854
]
Varun Saxena commented on YARN-3678:
------------------------------------
[~vinodkv], as this issue happened in our customer deployment, I will explain
the issue. We got an issue wherein NM was being randomly killed at one of the
places where Hadoop distribution is deployed. In logs, we could see NM being
killed immediately after {{signalContainer}} is called. What happens is as
under :
# LCE sends a SIGTERM to the container and waits for 250 ms
# Probably within this 250 ms period, container processes the signal and exits
gracefully.
# Now it is possible the pid assigned to container is taken up by some other
process or thread(which run as light weight processes in Linux).
# When LCE again tries to send a SIGKILL to the same pid, it might actually be
sending it to another process or thread.
# As we could not find any other reason for NM going randomly down, we suspect
it may have gone down because some new thread of NM took up this pid and
SIGKILL may have been sent to it, which may have crashed NM. This is more based
on suspicion though rather than fool proof analysis. Not sure how to verify if
this indeed happened.
Pls note that {{pid_max}} in the deployment was {{32768}}.
I am not sure about which user was the process owner though. Probably [~gu chi]
can shed some light on that.
An additional check can be done IMHO.
> DelayedProcessKiller may kill other process other than container
> ----------------------------------------------------------------
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.6.0
> Reporter: gu-chi
> Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still
> exist and will trigger once singalContainer, this will kill the process with
> the pid in PID file, but as container already finished, so this PID may be
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause.
> Even rarely occur.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)