[
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501002#comment-14501002
]
zhihai xu commented on YARN-3491:
---------------------------------
I did more profiling in checkLocalDir. It really surprised me.
The most time-consuming code is status.getPermission() not lfs.getFileStatus.
status.getPermission() will take 4 or 5 ms. checkLocalDir will call
status.getPermission() three times.
That is why checkLocalDir take 10+ms.
{code}
private boolean checkLocalDir(String localDir) {
Map<Path, FsPermission> pathPermissionMap =
getLocalDirsPathPermissionsMap(localDir);
for (Map.Entry<Path, FsPermission> entry : pathPermissionMap.entrySet()) {
FileStatus status;
try {
status = lfs.getFileStatus(entry.getKey());
} catch (Exception e) {
String msg =
"Could not carry out resource dir checks for " + localDir
+ ", which was marked as good";
LOG.warn(msg, e);
throw new YarnRuntimeException(msg, e);
}
if (!status.getPermission().equals(entry.getValue())) {
String msg =
"Permissions incorrectly set for dir " + entry.getKey()
+ ", should be " + entry.getValue() + ", actual value = "
+ status.getPermission();
LOG.warn(msg);
throw new YarnRuntimeException(msg);
}
}
return true;
}
{code}
Then I go deeper into the source code I find out why status.getPermission take
the most of time:
lfs.getFileStatus will return RawLocalFileSystem#DeprecatedRawLocalFileStatus,
{code}
public FsPermission getPermission() {
if (!isPermissionLoaded()) {
loadPermissionInfo();
}
return super.getPermission();
}
{code}
So status.getPermission will call loadPermissionInfo,
Based on the following code, loadPermissionInfo is bottle neck, it will call
run "ls -ld" to get the permission, which is really slow.
{code}
/// loads permissions, owner, and group from `ls -ld`
private void loadPermissionInfo() {
IOException e = null;
try {
String output = FileUtil.execCommand(new File(getPath().toUri()),
Shell.getGetPermissionCommand());
StringTokenizer t =
new StringTokenizer(output, Shell.TOKEN_SEPARATOR_REGEX);
//expected format
//-rw------- 1 username groupname ...
String permission = t.nextToken();
if (permission.length() > FsPermission.MAX_PERMISSION_LENGTH) {
//files with ACLs might have a '+'
permission = permission.substring(0,
FsPermission.MAX_PERMISSION_LENGTH);
}
setPermission(FsPermission.valueOf(permission));
t.nextToken();
String owner = t.nextToken();
// If on windows domain, token format is DOMAIN\\user and we want to
// extract only the user name
if (Shell.WINDOWS) {
int i = owner.indexOf('\\');
if (i != -1)
owner = owner.substring(i + 1);
}
setOwner(owner);
setGroup(t.nextToken());
} catch (Shell.ExitCodeException ioe) {
if (ioe.getExitCode() != 1) {
e = ioe;
} else {
setPermission(null);
setOwner(null);
setGroup(null);
}
} catch (IOException ioe) {
e = ioe;
} finally {
if (e != null) {
throw new RuntimeException("Error while running command to get " +
"file permissions : " +
StringUtils.stringifyException(e));
}
}
}
{code}
We should call getPermission as least as possible in the future :)
> PublicLocalizer#addResource is too slow.
> ----------------------------------------
>
> Key: YARN-3491
> URL: https://issues.apache.org/jira/browse/YARN-3491
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Affects Versions: 2.7.0
> Reporter: zhihai xu
> Assignee: zhihai xu
> Priority: Critical
> Attachments: YARN-3491.000.patch, YARN-3491.001.patch
>
>
> Based on the profiling, The bottleneck in PublicLocalizer#addResource is
> getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
> checkLocalDir is very slow which takes about 10+ ms.
> The total delay will be approximately number of local dirs * 10+ ms.
> This delay will be added for each public resource localization.
> Because PublicLocalizer#addResource is slow, the thread pool can't be fully
> utilized. Instead of doing public resource localization in
> parallel(multithreading), public resource localization is serialized most of
> the time.
> And also PublicLocalizer#addResource is running in Dispatcher thread,
> So the Dispatcher thread will be blocked by PublicLocalizer#addResource for
> long time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)