[ https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14941233#comment-14941233 ]
Jason Lowe commented on YARN-4051: ---------------------------------- Thanks for the patch! Sorry for the delay, as I missed this when it was originally filed. I'm lukewarm on an event buffering approach since we have to track it and remember to propagate it at all the appropriate times which is a maintenance burden. Would it be simpler if we simply prevented the kill request from coming in too soon? Seems like another way to fix this would be to prevent kill requests from arriving before we're done recovering containers. We could do a similar "try again" response as we do for container start requests while still recovering, and we can postpone finish application processing until after containers are recovered. However we decide to fix this, there should be a unit test to cover the scenario. > ContainerKillEvent is lost when container is In New State and is recovering > ---------------------------------------------------------------------------- > > Key: YARN-4051 > URL: https://issues.apache.org/jira/browse/YARN-4051 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: sandflee > Assignee: sandflee > Priority: Critical > Attachments: YARN-4051.01.patch, YARN-4051.02.patch, > YARN-4051.03.patch > > > As in YARN-4050, NM event dispatcher is blocked, and container is in New > state, when we finish application, the container still alive even after NM > event dispatcher is unblocked. -- This message was sent by Atlassian JIRA (v6.3.4#6332)