zhihai xu commented on YARN-3491:

Hi [~wilfreds], thanks for the review. A directory goes from bad to good can 
happen at any time, which is asynchronous to both public and private resource 
localization. Even without my change, it can still happen right after 
initialize local and log Dirs in current code. Also private resource 
localization initializes local and log Dirs per container not per resource. Our 
purpose is to make the failure chance less.
bq. Looking over the code there is also a lot of unneeded object creation which 
could be stripped out speeding things up and lowering memory usage.
I did the profiling for PublicLocalizer#addResource, all other code didn't take 
much time except checkLocalDir which calls getPermission three times. 
getPermission runs command "ls -ld" to get the permission, which is very slow.

But your comment gives me some good idea to find a better solution which can 
save more time:
We can call LocalDirsHandlerService#getLastDisksCheckTime to get the timestamp 
of previous disk-check. Using this information we only need initializes local 
and log Dirs when the timestamp is changed. The timestamp will only be changed 
every two minutes. It means we won't initialize local and log Dirs more than 
once in two minutes.

    diskHealthCheckInterval = conf.getLong(
public static final long DEFAULT_NM_DISK_HEALTH_CHECK_INTERVAL_MS = 120000L;

Hi [~jlowe], Do you think my new idea is reasonable? I would greatly appreciate 
it if you kindly give me some feedbacks on my new idea.

> PublicLocalizer#addResource is too slow.
> ----------------------------------------
>                 Key: YARN-3491
>                 URL: https://issues.apache.org/jira/browse/YARN-3491
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.7.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-3491.000.patch, YARN-3491.001.patch
> Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
> getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
> checkLocalDir is very slow which takes about 10+ ms.
> The total delay will be approximately number of local dirs * 10+ ms.
> This delay will be added for each public resource localization.
> Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
> utilized. Instead of doing public resource localization in 
> parallel(multithreading), public resource localization is serialized most of 
> the time.
> And also PublicLocalizer#addResource is running in Dispatcher thread, 
> So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
> long time.

This message was sent by Atlassian JIRA

Reply via email to