Chengbing Liu commented on YARN-3024:

[~xgong] Thanks for reviewing. 
on the latest patch, looks like you change the logic for
The logic of giving resources to be localized is actually changed.

Previously, {{LocalizedRunner}} does not give the next resource to 
{{ContainerLocalizer}} until the previous has been downloaded.

In this patch, {{LocalizedRunner}} will not wait for the previous resource to 
be downloaded. {{ContainerLocalizer}} can handle that by submitting the 
download task to its CompletionService, which is able to queue those tasks, 
before executing them. The download thread pool of the CompletionService 
remains a single thread executor.

Therefore, it is possible that {{ContainerLocalizer}} sends multiple 
{{LocalResourceStatus}} to {{LocalizerRunner}} through heartbeat. In this case, 
I think we should try to find the next resources to be localized even when 

I have tested it on a real cluster. I specified a large archive which should 
take a long time to be localized. The result shows they were getting localized 
serially, and one heartbeat contained multiple statuses of small files (thus 
reducing the number of heartbeat).

Could you fix this format
My bad, I will fix this.

> LocalizerRunner should give DIE action when all resources are localized
> -----------------------------------------------------------------------
>                 Key: YARN-3024
>                 URL: https://issues.apache.org/jira/browse/YARN-3024
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Chengbing Liu
>            Assignee: Chengbing Liu
>         Attachments: YARN-3024.01.patch, YARN-3024.02.patch, 
> YARN-3024.03.patch
> We have observed that {{LocalizerRunner}} always gives a LIVE action at the 
> end of localization process.
> The problem is {{findNextResource()}} can return null even when {{pending}} 
> was not empty prior to the call. This method removes localized resources from 
> {{pending}}, therefore we should check the return value, and gives DIE action 
> when it returns null.

This message was sent by Atlassian JIRA

Reply via email to