zhihai xu commented on YARN-3491:
Hi [~wilfreds], thanks for the review. A directory goes from bad to good can
happen at any time, which is asynchronous to both public and private resource
localization. Even without my change, it can still happen right after
initialize local and log Dirs in current code. Also private resource
localization initializes local and log Dirs per container not per resource. Our
purpose is to make the failure chance less.
bq. Looking over the code there is also a lot of unneeded object creation which
could be stripped out speeding things up and lowering memory usage.
I did the profiling for PublicLocalizer#addResource, all other code didn't take
much time except checkLocalDir which calls getPermission three times.
getPermission runs command "ls -ld" to get the permission, which is very slow.
But your comment gives me some good idea to find a better solution which can
save more time:
We can call LocalDirsHandlerService#getLastDisksCheckTime to get the timestamp
of previous disk-check. Using this information we only need initializes local
and log Dirs when the timestamp is changed. The timestamp will only be changed
every two minutes. It means we won't initialize local and log Dirs more than
once in two minutes.
diskHealthCheckInterval = conf.getLong(
public static final long DEFAULT_NM_DISK_HEALTH_CHECK_INTERVAL_MS = 120000L;
Hi [~jlowe], Do you think my new idea is reasonable? I would greatly appreciate
it if you kindly give me some feedbacks on my new idea.
> PublicLocalizer#addResource is too slow.
> Key: YARN-3491
> URL: https://issues.apache.org/jira/browse/YARN-3491
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Affects Versions: 2.7.0
> Reporter: zhihai xu
> Assignee: zhihai xu
> Priority: Critical
> Attachments: YARN-3491.000.patch, YARN-3491.001.patch
> Based on the profiling, The bottleneck in PublicLocalizer#addResource is
> getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
> checkLocalDir is very slow which takes about 10+ ms.
> The total delay will be approximately number of local dirs * 10+ ms.
> This delay will be added for each public resource localization.
> Because PublicLocalizer#addResource is slow, the thread pool can't be fully
> utilized. Instead of doing public resource localization in
> parallel(multithreading), public resource localization is serialized most of
> the time.
> And also PublicLocalizer#addResource is running in Dispatcher thread,
> So the Dispatcher thread will be blocked by PublicLocalizer#addResource for
> long time.
This message was sent by Atlassian JIRA