[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503540#comment-14503540
 ] 

Jason Lowe commented on YARN-3464:
----------------------------------

We shouldn't leave a ContainerLocalizer lingering around if the container is 
ready to be launched, as that's wasting node resources and adding extra 
localizer heartbeat processing on the NM we don't need to do.  One exception to 
that would be if we want to support localizing new resources while a container 
is already running, but last I checked we don't support that.

IMHO it makes sense to kill the localizer when the container is ready to be 
launched.  If it's not ready to be launched then we may need to (re)localize 
some resource and the localizer would have some utility to keep running.  So 
I'd look into changing the logic from "kill me when there's no more work in my 
queue" to "kill me when my container is ready to be launched."



> Race condition in LocalizerRunner causes container localization timeout.
> ------------------------------------------------------------------------
>
>                 Key: YARN-3464
>                 URL: https://issues.apache.org/jira/browse/YARN-3464
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>       } else if (pending.isEmpty()) {
>         action = LocalizerAction.DIE;
>       }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to