[
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996687#comment-14996687
]
Jason Lowe commented on YARN-4051:
----------------------------------
If I understand this correctly, we're saying that the problem described in
YARN-4050 is holding up the main event dispatcher and the NM is semi-hung, yet
we want to hurry and register with the ResourceManager before containers have
recovered? Seems to me we need to address the problem described in YARN-4050
if possible (e.g.: skip HDFS operations if we recovered at least one container
in the running or completed states since we know it must have done HDFS init in
the previous NM instance). Otherwise we are hacking around the fact that we
registered too soon and aren't able to properly handle the out-of-order events.
I'd much rather deal with the root cause if possible than patch all the
separate symptoms.
> ContainerKillEvent is lost when container is In New State and is recovering
> ----------------------------------------------------------------------------
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: sandflee
> Assignee: sandflee
> Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch,
> YARN-4051.03.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New
> state, when we finish application, the container still alive even after NM
> event dispatcher is unblocked.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)