Hong Zhiguo commented on YARN-3678:

the event sequence:
call "SEND SIGTERM"  ->  pid recycle   ->  call "SEND SIGKILL"  -> check 
process live time(based on current time)

The time between [call "SEND SIGTERM"] and [call "SEND SIGKILL"] is 250ms
The time between [pid recycle] and [check process live time] may be shorter or 
longer than 250ms. When it's longer than 250ms, there's chance we make false 
positive judgement.

> DelayedProcessKiller may kill other process other than container
> ----------------------------------------------------------------
>                 Key: YARN-3678
>                 URL: https://issues.apache.org/jira/browse/YARN-3678
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: gu-chi
>            Priority: Critical
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.

This message was sent by Atlassian JIRA

Reply via email to