[
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570652#comment-14570652
]
Lavkesh Lahngir commented on YARN-3591:
---------------------------------------
Thanks [~sunilg] and [~zxu] for comments and review. I did slightly
differently. I added newRepairedDirs and newErrorDirs into DirectoryCollection.
In this version checkLocalizedResources(dirsTocheck) takes the list of good
dirs.
{code:title=DirectoryCollection.java|borderStyle=solid}
+ private List<String> newErrorDirs;
+ private List<String> newRepariedDirs;
private int numFailures;
@@ -159,6 +161,8 @@ public DirectoryCollection(String[] dirs,
localDirs = new CopyOnWriteArrayList<String>(dirs);
errorDirs = new CopyOnWriteArrayList<String>();
fullDirs = new CopyOnWriteArrayList<String>();
+ newErrorDirs = new CopyOnWriteArrayList<String>();
+ newRepariedDirs = new CopyOnWriteArrayList<String>();
@@ -213,6 +217,20 @@ synchronized int getNumFailures() {
}
/**
+ * @return Recently discovered error dirs
+ */
+ synchronized List<String> getNewErrorDirs() {
+ return newErrorDirs;
+ }
+
+ /**
+ * @return Recently discovered repaired dirs
+ */
+ synchronized List<String> getNewRepairedDirs() {
+ return newRepariedDirs;
+ }
+
@@ -259,6 +277,8 @@ synchronized boolean checkDirs() {
localDirs.clear();
errorDirs.clear();
fullDirs.clear();
+ newRepariedDirs.clear();
+ newErrorDirs.clear();
for (Map.Entry<String, DiskErrorInformation> entry : dirsFailedCheck
.entrySet()) {
@@ -292,6 +312,11 @@ synchronized boolean checkDirs() {
}
Set<String> postCheckFullDirs = new HashSet<String>(fullDirs);
Set<String> postCheckOtherDirs = new HashSet<String>(errorDirs);
+ for (String dir : preCheckGoodDirs) {
+ if (postCheckOtherDirs.contains(dir)) {
+ newErrorDirs.add(dir);
+ }
+ }
for (String dir : preCheckFullDirs) {
if (postCheckOtherDirs.contains(dir)) {
LOG.warn("Directory " + dir + " error "
@@ -304,6 +329,9 @@ synchronized boolean checkDirs() {
LOG.warn("Directory " + dir + " error "
+ dirsFailedCheck.get(dir).message);
}
+ if (localDirs.contains(dir) || postCheckFullDirs.contains(dir)) {
+ newRepariedDirs.add(dir);
+ }
}
{code}
{code:title=LocalDirsHandlerService.java|borderStyle=solid}
+ * @return Recently added error dirs
+ */
+ public List<String> getDiskNewErrorDirs() {
+ return localDirs.getNewErrorDirs();
+ }
+
+ /**
+ * @return Recently added repaired dirs
+ */
+ public List<String> getDiskNewRepairedDirs() {
+ return localDirs.getNewRepairedDirs();
+ }
{code}
{code:title=ResourceLocalizationService.java|borderStyle=solid}
@Override
public void onDirsChanged() {
checkAndInitializeLocalDirs();
+ List<String> dirsTocheck =
+ new ArrayList<String>(dirsHandler.getLocalDirs());
+ dirsTocheck.addAll(dirsHandler.getDiskFullLocalDirs());
+ // checks if resources are present in the dirsTocheck
+ publicRsrc.checkLocalizedResources(dirsTocheck);
for (LocalResourcesTracker tracker : privateRsrc.values()) {
+ tracker.checkLocalizedResources(dirsTocheck);
+ }
+ List<String> newRepairedDirs = dirsHandler.getDiskNewRepairedDirs();
+ // Delete any resources found in the newly repaired Dirs.
+ for (String dir : newRepairedDirs) {
+ cleanUpLocalDir(lfs, delService, dir);
}
+ // Add code here to add errordirs to statestore.
}
};
{code}
{code:title=DirectoryCollection.java|borderStyle=solid}
synchronized List<String> getErrorDirs() {
return Collections.unmodifiableList(errorDirs);
}
{code}
We can use getErroeDirs() and keep it in the NMstate as suggested and upon
start we can do a cleanUpLocalDir on the errordirs.
> Resource Localisation on a bad disk causes subsequent containers failure
> -------------------------------------------------------------------------
>
> Key: YARN-3591
> URL: https://issues.apache.org/jira/browse/YARN-3591
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Lavkesh Lahngir
> Assignee: Lavkesh Lahngir
> Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch,
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch
>
>
> It happens when a resource is localised on the disk, after localising that
> disk has gone bad. NM keeps paths for localised resources in memory. At the
> time of resource request isResourcePresent(rsrc) will be called which calls
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which
> will call open() natively. If the disk is good it should return an array of
> paths with length at-least 1.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)