[
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499534#comment-14499534
]
zhihai xu commented on YARN-3491:
---------------------------------
Hi [~jlowe], You are right, I am really sorry all my previous guesses are wrong.
I did the profiling and I find out the bottleneck is at the following code
{code}
getInitializedLocalDirs();
getInitializedLogDirs();
{code}
More accurately the bottleneck is at checkLocalDir which call getFileStatus.
I did two round profiling:
1.I measure the time in PublicLocalizer#addResource:
the following code include levelDB operation take 1 ms.
{code}
Path publicRootPath =
dirsHandler.getLocalPathForWrite("." + Path.SEPARATOR
+ ContainerLocalizer.FILECACHE,
ContainerLocalizer.getEstimatedSize(resource), true);
Path publicDirDestPath =
publicRsrc.getPathForLocalization(key, publicRootPath);
if (!publicDirDestPath.getParent().equals(publicRootPath)) {
DiskChecker.checkDir(new
File(publicDirDestPath.toUri().getPath()));
}
{code}
getInitializedLocalDirs and getInitializedLogDirs take 12 ms together
And the following queue.submit code take less than 1 ms.
{code}
synchronized (pending) {
pending.put(queue.submit(new FSDownload(lfs, null, conf,
publicDirDestPath, resource,
request.getContext().getStatCache())),
request);
}
{code}
2. then I measure the time in getInitializedLocalDirs and getInitializedLogDirs.
I find out checkLocalDir is really slow which is called by
getInitializedLocalDirs.
checkLocalDir takes 14 ms. There is only one local Dir in my test environment.
{code}
synchronized private List<String> getInitializedLocalDirs() {
List<String> dirs = dirsHandler.getLocalDirs();
List<String> checkFailedDirs = new ArrayList<String>();
for (String dir : dirs) {
try {
checkLocalDir(dir);
} catch (YarnRuntimeException e) {
checkFailedDirs.add(dir);
}
}
{code}
The log in my previous comment has more than 10 local Dirs, which will call
checkLocalDir more than 10 times
10 * 14 is about 100+ms, So I find out where the 100+ms delay come from.
I attached a patch YARN-3491.000.patch to fix the issue, The patch will call
getInitializedLocalDirs only once for each container.
The original code will call getInitializedLocalDirs for each public resource.
Each container can have hundreds of public resource, which is the situation in
my previous log.
[~jlowe], Could you review it? thanks
> PublicLocalizer#addResource is too slow.
> ----------------------------------------
>
> Key: YARN-3491
> URL: https://issues.apache.org/jira/browse/YARN-3491
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Affects Versions: 2.7.0
> Reporter: zhihai xu
> Assignee: zhihai xu
> Priority: Critical
>
> Improve the public resource localization to do both FSDownload submission to
> the thread pool and completed localization handling in one thread
> (PublicLocalizer).
> Currently FSDownload submission to the thread pool is done in
> PublicLocalizer#addResource which is running in Dispatcher thread and
> completed localization handling is done in PublicLocalizer#run which is
> running in PublicLocalizer thread.
> Because PublicLocalizer#addResource is time consuming, the thread pool can't
> be fully utilized. Instead of doing public resource localization in
> parallel(multithreading), public resource localization is serialized most of
> the time.
> Also there are two more benefits with this change:
> 1. The Dispatcher thread won't be blocked by PublicLocalizer#addResource .
> Dispatcher thread handles most of time critical events at Node manager.
> 2. don't need synchronization on HashMap (pending).
> Because pending will be only accessed in PublicLocalizer thread.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)