[
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14941233#comment-14941233
]
Jason Lowe commented on YARN-4051:
----------------------------------
Thanks for the patch! Sorry for the delay, as I missed this when it was
originally filed.
I'm lukewarm on an event buffering approach since we have to track it and
remember to propagate it at all the appropriate times which is a maintenance
burden. Would it be simpler if we simply prevented the kill request from
coming in too soon? Seems like another way to fix this would be to prevent
kill requests from arriving before we're done recovering containers. We could
do a similar "try again" response as we do for container start requests while
still recovering, and we can postpone finish application processing until after
containers are recovered.
However we decide to fix this, there should be a unit test to cover the
scenario.
> ContainerKillEvent is lost when container is In New State and is recovering
> ----------------------------------------------------------------------------
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: sandflee
> Assignee: sandflee
> Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch,
> YARN-4051.03.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New
> state, when we finish application, the container still alive even after NM
> event dispatcher is unblocked.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)