zhihai xu created YARN-3727:
-------------------------------
Summary: For better error recovery, check if the directory exists
before using it for localization.
Key: YARN-3727
URL: https://issues.apache.org/jira/browse/YARN-3727
Project: Hadoop YARN
Issue Type: Improvement
Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
For better error recovery, check if the directory exists before using it for
localization.
We saw the following localization failure happened due to existing cache
directories.
{code}
2015-05-11 18:59:59,756 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
DEBUG: FAILED { hdfs://XXXX/XXXXX/libjars/1234.jar, 1431395961545, FILE, null
}, Rename cannot overwrite non empty destination directory
/XXXX/8/yarn/nm/usercache/XXXX/filecache/21637
2015-05-11 18:59:59,756 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
Resource
hdfs://XXXX/XXXXX/libjars/1234.jar(->/XXXX/8/yarn/nm/usercache/XXXX/filecache/21637/1234.jar)
transitioned from DOWNLOADING to FAILED
{code}
The real cause for this failure may be disk failure, LevelDB operation failure
for {{startResourceLocalization}}/{{finishResourceLocalization}} or others.
I wonder whether we can add error recovery code to avoid the localization
failure by not using the existing cache directories for localization.
The exception happened at {{files.rename(dst_work, destDirPath,
Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after the
exception, the existing cache directory used by {{LocalizedResource}} will be
deleted.
{{code}}
try {
.........
files.rename(dst_work, destDirPath, Rename.OVERWRITE);
} catch (Exception e) {
try {
files.delete(destDirPath, true);
} catch (IOException ignore) {
}
throw e;
} finally {
{{code}}
Since the conflicting local directory will be deleted after localization
failure,
I think it will be better to check if the directory exists before using it for
localization to avoid the localization failure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)