Jun Gong commented on YARN-4536:

[~gu chi] Thanks for explaining it. Yes, we also came across the problem, and 
have applied the patch in YARN-4459, it works well now. I explained more in 
that issue's comments. Maybe you could help review and try it. Thanks.

> DelayedProcessKiller may not work under heavy workload
> ------------------------------------------------------
>                 Key: YARN-4536
>                 URL: https://issues.apache.org/jira/browse/YARN-4536
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.1
>            Reporter: gu-chi
> I am now facing with orphan process of container. Here is the scenario:
> With heavy task load, the NM machine CPU usage can reach almost 100%. When 
> some container got event of kill, it will get  {{SIGTERM}} , and then the 
> parent process exit, leave the container process to OS. This container 
> process need handle some shutdown events or some logic, but hardly can get 
> CPU, we suppose to see a {{SIGKILL}} as there is {{DelayedProcessKiller}} 
> ,but the parent process which persisted as container pid no longer exist, so 
> the kill command can not reach the container process. This is how orphan 
> container process come.
> The orphan process do exit after some time, but the period can be very long, 
> and will make the OS status worse. As I observed, the period can be several 
> hours

This message was sent by Atlassian JIRA

Reply via email to