zhihai xu commented on YARN-3464:

[~kasha], thanks for the information. I just looked at YARN-3024, Yes, it will 
make this issue happen more frequently.
Before YARN-3024, The localization for private resource is one by one. The next 
one won't start until the current one finish localization.
It will take longer time for private resource localization.
With YARN-3024, The localization will be done in parallel, multiple files can 
be localized at the same time.
The chance for ContainerLocalizer being killed when the last two PRIVATE 
LocalizerResourceRequestEvent are added is bigger.
Yes, your suggestion is also what I thought.

> Race condition in LocalizerRunner causes container localization timeout.
> ------------------------------------------------------------------------
>                 Key: YARN-3464
>                 URL: https://issues.apache.org/jira/browse/YARN-3464
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>       } else if (pending.isEmpty()) {
>         action = LocalizerAction.DIE;
>       }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 

This message was sent by Atlassian JIRA

Reply via email to