gu-chi created YARN-4536:
----------------------------

             Summary: DelayedProcessKiller may not work under heavy workload
                 Key: YARN-4536
                 URL: https://issues.apache.org/jira/browse/YARN-4536
             Project: Hadoop YARN
          Issue Type: Bug
          Components: nodemanager
    Affects Versions: 2.7.1
            Reporter: gu-chi


I am now facing with orphan process of container. Here is the scenario:
With heavy task load, the NM machine CPU usage can reach almost 100%. When some 
container got event of kill, it will get  {{SIGTERM}} , and then the parent 
process exit, leave the container process to OS. This container process need 
handle some shutdown events or some logic, but hardly can get CPU, we suppose 
to see a {{SIGKILL}} as there is {{DelayedProcessKiller}} ,but the parent 
process which persisted as container pid no longer exist, so the kill command 
can not reach the container process. This is how orphan container process come.
The orphan process do exit after some time, but the period can be very long, 
and will make the OS status worse. As I observed, the period can be several 
hours



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to