Varun Saxena commented on YARN-3678:

[~vinodkv], as this issue happened in our customer deployment, I will explain 
the issue. We got an issue wherein NM was being randomly killed at one of the 
places where Hadoop distribution is deployed. In logs, we could see NM being 
killed immediately after {{signalContainer}} is called. What happens is as 
under : 
# LCE sends a SIGTERM to the container and waits for 250 ms
# Probably within this 250 ms period, container processes the signal and exits 
# Now it is possible the pid assigned to container is taken up by some other 
process or thread(which run as light weight processes in Linux).
# When LCE again tries to send a SIGKILL to the same pid, it might actually be 
sending it to another process or thread.
# As we could not find any other reason for NM going randomly down, we suspect 
it may have gone down because some new thread of NM took up this pid and 
SIGKILL may have been sent to it, which may have crashed NM. This is more based 
on suspicion though rather than fool proof analysis. Not sure how to verify if 
this indeed happened.

Pls note that {{pid_max}} in the deployment was {{32768}}.
I am not sure about which user was the process owner though. Probably [~gu chi] 
can shed some light on that.
An additional check can be done IMHO.

> DelayedProcessKiller may kill other process other than container
> ----------------------------------------------------------------
>                 Key: YARN-3678
>                 URL: https://issues.apache.org/jira/browse/YARN-3678
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: gu-chi
>            Priority: Critical
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.

This message was sent by Atlassian JIRA

Reply via email to