[ 
https://issues.apache.org/jira/browse/YARN-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276470#comment-14276470
 ] 

Chengbing Liu commented on YARN-3024:
-------------------------------------

[~xgong] Thanks for reviewing. 
{quote}
on the latest patch, looks like you change the logic for
{quote}
The logic of giving resources to be localized is actually changed.

Previously, {{LocalizedRunner}} does not give the next resource to 
{{ContainerLocalizer}} until the previous has been downloaded.

In this patch, {{LocalizedRunner}} will not wait for the previous resource to 
be downloaded. {{ContainerLocalizer}} can handle that by submitting the 
download task to its CompletionService, which is able to queue those tasks, 
before executing them. The download thread pool of the CompletionService 
remains a single thread executor.

Therefore, it is possible that {{ContainerLocalizer}} sends multiple 
{{LocalResourceStatus}} to {{LocalizerRunner}} through heartbeat. In this case, 
I think we should try to find the next resources to be localized even when 
getting FETCH_PENDING.

I have tested it on a real cluster. I specified a large archive which should 
take a long time to be localized. The result shows they were getting localized 
serially, and one heartbeat contained multiple statuses of small files (thus 
reducing the number of heartbeat).

{quote}
Could you fix this format
{quote}
My bad, I will fix this.

> LocalizerRunner should give DIE action when all resources are localized
> -----------------------------------------------------------------------
>
>                 Key: YARN-3024
>                 URL: https://issues.apache.org/jira/browse/YARN-3024
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Chengbing Liu
>            Assignee: Chengbing Liu
>         Attachments: YARN-3024.01.patch, YARN-3024.02.patch, 
> YARN-3024.03.patch
>
>
> We have observed that {{LocalizerRunner}} always gives a LIVE action at the 
> end of localization process.
> The problem is {{findNextResource()}} can return null even when {{pending}} 
> was not empty prior to the call. This method removes localized resources from 
> {{pending}}, therefore we should check the return value, and gives DIE action 
> when it returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to