zhihai xu commented on YARN-3491:

I uploaded a new patch YARN-3491.001.patch for review 
I think a little bit deeper, The old patch may have a big delay if multiple 
containers are submitted at the same time.
For example the following log shows 4 containers submitted at very close time:
2015-04-07 21:42:22,071 INFO 
Container container_e30_1426628374875_110648_01_078264 transitioned from NEW to 
2015-04-07 21:42:22,074 INFO 
Container container_e30_1426628374875_110652_01_093777 transitioned from NEW to 
2015-04-07 21:42:22,076 INFO 
Container container_e30_1426628374875_110668_01_049049 transitioned from NEW to 
2015-04-07 21:42:22,078 INFO 
Container container_e30_1426628374875_110668_01_085183 transitioned from NEW to 
The new patch can overlap the delay with public localization from previous 
container, which will be a little bit better and more consistent with the 
behavior in the old code.
Also It will be better for the container which only has private resource and no 
public resource. For this case, no delay will be added to Dispatcher thread.
Finally the change in new patch is a little bit smaller than the first patch.

> PublicLocalizer#addResource is too slow.
> ----------------------------------------
>                 Key: YARN-3491
>                 URL: https://issues.apache.org/jira/browse/YARN-3491
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.7.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-3491.000.patch, YARN-3491.001.patch
> Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
> getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
> checkLocalDir is very slow which takes about 10+ ms.
> The total delay will be approximately number of local dirs * 10+ ms.
> This delay will be added for each public resource localization.
> Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
> utilized. Instead of doing public resource localization in 
> parallel(multithreading), public resource localization is serialized most of 
> the time.
> And also PublicLocalizer#addResource is running in Dispatcher thread, 
> So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
> long time.

This message was sent by Atlassian JIRA

Reply via email to