Jason Lowe commented on YARN-4051:

If I understand this correctly, we're saying that the problem described in 
YARN-4050 is holding up the main event dispatcher and the NM is semi-hung, yet 
we want to hurry and register with the ResourceManager before containers have 
recovered?  Seems to me we need to address the problem described in YARN-4050 
if possible (e.g.: skip HDFS operations if we recovered at least one container 
in the running or completed states since we know it must have done HDFS init in 
the previous NM instance).  Otherwise we are hacking around the fact that we 
registered too soon and aren't able to properly handle the out-of-order events. 
 I'd much rather deal with the root cause if possible than patch all the 
separate symptoms.

> ContainerKillEvent is lost when container is  In New State and is recovering
> ----------------------------------------------------------------------------
>                 Key: YARN-4051
>                 URL: https://issues.apache.org/jira/browse/YARN-4051
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: sandflee
>            Assignee: sandflee
>            Priority: Critical
>         Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.

This message was sent by Atlassian JIRA

Reply via email to