[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175191#comment-14175191 ] Hadoop QA commented on YARN-90: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675491/apache-yarn-90.10.patch against trunk revision 3687431. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5434//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5434//console This message is automatically generated. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, > apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, > apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175175#comment-14175175 ] Ming Ma commented on YARN-90: - Thanks Varun.The latest patch LGTM. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, > apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, > apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172050#comment-14172050 ] Ming Ma commented on YARN-90: - Thanks Varun. You and Jason discussed about disk clean up scenario. It will be useful to clarify if the following scenario will be resolved by this jira or a separate jira is necessary. 1. A disk became ready only. So DiskChecker will mark it as DiskErrorCause.OTHER. 2. Later the disk was repaired and became good. There are still data left on the disk. 3. Given these data are from old containers which have finished, who will clean up these data? Nit: disksTurnedBad's parameter name preCheckDirs, it is better to name it preFailedDirs. In the getDisksHealthReport, people can't tell if the disk fails due to full disk or failed disk, might be useful to distinguish the two cases. verifyDirUsingMkdir, is it necessary given DiskChecker.checkDir will check it? > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, > apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, > apache-yarn-90.8.patch, apache-yarn-90.9.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14170711#comment-14170711 ] Hadoop QA commented on YARN-90: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674722/apache-yarn-90.9.patch against trunk revision 5faaba0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5386//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5386//console This message is automatically generated. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, > apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, > apache-yarn-90.8.patch, apache-yarn-90.9.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167222#comment-14167222 ] zhihai xu commented on YARN-90: --- In function verifyDirUsingMkdir, target.exists(), target.mkdir() and FileUtils.deleteQuietly(target) is not atomic, What happen if another thread try to create the same directory(target)? > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, > apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, > apache-yarn-90.8.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164214#comment-14164214 ] zhihai xu commented on YARN-90: --- I looked at the patch: some nits I found: 1. can change if (!postCheckFullDirs.contains(dir) && postCheckOtherDirs.contains(dir)) { to if (postCheckOtherDirs.contains(dir)) { because postCheckFullDirs and postCheckOtherDirs are mutually exclusive set. 2. same to item 1 change if (!postCheckOtherDirs.contains(dir) && postCheckFullDirs.contains(dir)) { to if (postCheckFullDirs.contains(dir)) { 3. in verifyDirUsingMkdir: Can we add int variable to file name to avoid loop forever(although it is a very small chance) like the following? long i = 0L; while (target.exists()) { randomDirName = RandomStringUtils.randomAlphanumeric(5) + i++; target = new File(dir, randomDirName); } 4. in disksTurnedBad: Can we add break in the loop when disksFailed is true so we exit the loop earlier? if (!preCheckDirs.contains(dir)) { disksFailed = true; break; } 5. in disksTurnedGood same as item 4: Can we add break in the loop when disksTurnedGood is true? thanks > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, > apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, > apache-yarn-90.8.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158643#comment-14158643 ] Ming Ma commented on YARN-90: - Thanks, Varun. The main question about UNHEALTHY state is whether this patch might make it more likely for a node to become unhealthy given "full disk" has been added as one of the conditions. Given [~jira.shegalov]'s YARN-1996 and [~sjlee0]'s MAPREDUCE-5817 have suggestions to mitigate the impact of UNHEALTHY nodes on existing containers and MR task scheduling, this might not be an issue. Nit: For "Set postCheckFullDirs = new HashSet(fullDirs);". It doesn't have to create postCheckFullDirs. It can directly refer to fullDirs later. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, > apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, > apache-yarn-90.8.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155747#comment-14155747 ] Varun Vasudev commented on YARN-90: --- The release audit warning is unrelated to the patch. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, > apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, > apache-yarn-90.8.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155745#comment-14155745 ] Hadoop QA commented on YARN-90: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672436/apache-yarn-90.8.patch against trunk revision dd1b8f2. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5213//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5213//artifact/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5213//console This message is automatically generated. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, > apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, > apache-yarn-90.8.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154211#comment-14154211 ] Ming Ma commented on YARN-90: - Thanks, Varun, Jason. Couple comments: 1. What if a dir is transitioned from DISK_FULL state to OTHER state? DirectoryCollection.checkDirs doesn't seem to update errorDirs and fullDirs properly. We can use some state machine for each dir and make sure each transition is covered. 2. DISK_FULL state is counted toward the error disk threshold by LocalDirsHandlerService.areDisksHealthy; later RM could mark NM NODE_UNUSABLE. If we believe DISK_FULL is mostly temporary issue, should we consider disks are healthy if disks only stay in DISK_FULL for some short period of time? 3. In AppLogAggregatorImpl.java, "(Path[]) localAppLogDirs.toArray(new Path[localAppLogDirs.size()]).". It seems the (Path[]) cast isn't necessary. 4. What is the intention of numFailures? Method getNumFailures isn't used. 5. Nit: It is better to expand "import java.util.*;" in DirectoryCollection.java and LocalDirsHandlerService.java. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, > apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14153750#comment-14153750 ] Hadoop QA commented on YARN-90: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672125/apache-yarn-90.7.patch against trunk revision 9582a50. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5187//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5187//console This message is automatically generated. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, > apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152307#comment-14152307 ] Jason Lowe commented on YARN-90: Thanks for updating the patch, Varun. bq. I've changed it to "Disk(s) health report: ". My only concern with this is that there might be scripts looking for the "Disk(s) failed" log line for monitoring. What do you think? If that's true then the code should bother to do a diff between the old disk list and the new one, logging which disks turned bad using the "Disk(s) failed" line and which disks became healthy with some other log message. bq. Directories are only cleaned up during startup. The code tests for existence of the directories and the correct permissions. This does mean that container directories left behind for any reason won't get cleaned up unit the NodeManager is restarted. Is that ok? This could still be problematic for the NM work-preserving restart case, as we could try to delete an entire disk tree with active containers on it due to a hiccup when the NM restarts. I think a better approach is a periodic cleanup scan that looks for directories under yarn-local and yarn-logs that shouldn't be there. This could be part of the health check scan or done separately. That way we don't have to wait for a disk to turn good or bad to catch leaked entities on the disk due to some hiccup. Sorta like an fsck for the NM state on disk. That is best done as a separate JIRA, as I think this functionality is still an incremental improvement without it. Other comments: checkDirs unnecessarily calls union(errorDirs, fullDirs) twice. isDiskFreeSpaceOverLimt is now named backwards, as the code returns true if the free space is under the limit. getLocalDirsForCleanup and getLogDirsForCleanup should have javadoc comments like the other methods. Nit: The union utility function doesn't technically perform a union but rather a concatenation, and it'd be a little clearer if the name reflected that. Also the function should leverage the fact that it knows how big the ArrayList will be after the operations and give it the appropriate hint to its constructor to avoid reallocations. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, > apache-yarn-90.5.patch, apache-yarn-90.6.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147090#comment-14147090 ] Hadoop QA commented on YARN-90: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671081/apache-yarn-90.6.patch against trunk revision 3cde37c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5109//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5109//console This message is automatically generated. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, > apache-yarn-90.5.patch, apache-yarn-90.6.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146937#comment-14146937 ] Hadoop QA commented on YARN-90: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671047/apache-yarn-90.5.patch against trunk revision 9fa5a89. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1270 javac compiler warnings (more than the trunk's current 1265 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.TestNonAggregatingLogHandler org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5107//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5107//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5107//console This message is automatically generated. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, > apache-yarn-90.5.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143974#comment-14143974 ] Jason Lowe commented on YARN-90: Thanks, Varun! Comments on the latest patch: It's a bit odd to have a hash map to map disk error types to lists of directories, fill them all in, but we only in practice actually look at one type in the map and that's DISK_FULL. It'd be simpler (and faster and less space since there's no hashmap involved) to just track full disks as a separate collection like we already do for localDirs and failedDirs. Nit: DISK_ERROR_CAUSE should be DiskErrorCause (if we keep the enum) to match the style of other enum types in the code. In verifyDirUsingMkdir, if an error occurs during the finally clause then that exception will mask the original exception isDiskUsageUnderPercentageLimit is named backwards. Disk usage being under the configured limit shouldn't be a full disk error, and the error message is inconsistent with the method name (method talks about being under but error message says its above). {code} if (isDiskUsageUnderPercentageLimit(testDir)) { msg = "used space above threshold of " + diskUtilizationPercentageCutoff + "%, removing from the list of valid directories."; {code} We should only call getDisksHealthReport() once in the following code: {code} +String report = getDisksHealthReport(); +if (!report.isEmpty()) { + LOG.info("Disk(s) failed. " + getDisksHealthReport()); {code} Should updateDirsAfterTest always say "Disk(s) failed" if the report isn't empty? Thinking of the case where two disks go bad, then one later is restored. The health report will still have something, but that last update is a disk turning good not failing. Before this code was only called when a new disk failed, and now that's not always the case. Maybe it should just be something like "Disk health update: " instead? Is it really necessary to stat a directory before we try to delete it? Seems like we can just try to delete it. The idiom of getting the directories and adding the full directories seems pretty common. Might be good to have dirhandler methods that already do this, like getLocalDirsForCleanup or getLogDirsForCleanup. I'm a bit worried that getInitializedLocalDirs could potentially try to delete an entire directory tree for a disk. If this fails in some sector-specific way but other containers are currently using their files from other sectors just fine on the same disk, removing these files from underneath active containers could be very problematic and difficult to debug. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140663#comment-14140663 ] Hadoop QA commented on YARN-90: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669998/apache-yarn-90.4.patch against trunk revision bf27b9c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5045//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5045//console This message is automatically generated. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140596#comment-14140596 ] Hadoop QA commented on YARN-90: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669994/apache-yarn-90.3.patch against trunk revision 6fe5c6b. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5044//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5044//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5044//console This message is automatically generated. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch, apache-yarn-90.3.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119795#comment-14119795 ] Varun Vasudev commented on YARN-90: --- [~yxls123123] the patch needs to be rebased. It's currently causing a merge conflict. Give me a couple of days and I should be able to sort it out. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119194#comment-14119194 ] Xu Yang commented on YARN-90: - Hi, [~vvasudev]. Thanks a lot for your patch. Is it finished? I think this feature is very useful. If it isn't commited, maybe I need merge the patch manually. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118193#comment-14118193 ] Hadoop QA commented on YARN-90: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12634358/apache-yarn-90.2.patch against trunk revision 258c7d0. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4799//console This message is automatically generated. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932976#comment-13932976 ] Hadoop QA commented on YARN-90: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12634358/apache-yarn-90.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3344//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3344//console This message is automatically generated. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, > apache-yarn-90.2.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932539#comment-13932539 ] Hadoop QA commented on YARN-90: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12634255/apache-yarn-90.1.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3337//console This message is automatically generated. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13926199#comment-13926199 ] Xuan Gong commented on YARN-90: --- +1 LGTM > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13925905#comment-13925905 ] Hadoop QA commented on YARN-90: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633727/apache-yarn-90.0.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3310//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3310//console This message is automatically generated. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi >Assignee: Varun Vasudev > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918439#comment-13918439 ] Ravi Prakash commented on YARN-90: -- I'm not working on it. Please feel free to take it over. Thanks Varun > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918405#comment-13918405 ] Varun Vasudev commented on YARN-90: --- Ravi, are you still working on this ticket? Do you mind if I take over? > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch, YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13816232#comment-13816232 ] Ravi Prakash commented on YARN-90: -- Thanks for updating the patch Song! With almost the same changes as nigel, I was able to get the originally invalid directories to be used again. So the src/main code looks good to me. The one nit I had was that {code} } catch (IOException e2) { Assert.fail("should not throw an exception"); Shell.execCommand(Shell.getSetPermissionCommand("755", false, testDir.getAbsolutePath())); throw e2; } {code}, {code} catch (InterruptedException e1) { } {code} , {code} } catch (IOException e2) { Assert.fail("should not throw an exception"); throw e2; } {code} and {code} } catch (IOException e) { Assert.fail("Service should have thrown an exception while closing"); throw e; } {code} can simply be removed. Other than that, the patch looks good to me. +1. Thanks a lot Nigel and Song! > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814090#comment-13814090 ] Vinod Kumar Vavilapalli commented on YARN-90: - bq. However, I don't quite understand your saying "expose this end-to-end and not just metrics". We have been using failed-disk metric in our prodution cluster for a year, and it's good enough for our rapid disk repairment. Enlight me if you have a better way. I meant that it should be part of client side RPC report, JMX as well as the metrics. Doing only one of those is incomplete and so I was suggesting that we do all of that in a separate JIRA. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813811#comment-13813811 ] Hou Song commented on YARN-90: -- Thanks for the suggestions. I'm trying to modify my patch, and will upload it soon. However, I don't quite understand your saying "expose this end-to-end and not just metrics". We have been using failed-disk metric in our prodution cluster for a year, and it's good enough for our rapid disk repairment. Enlight me if you have a better way. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813183#comment-13813183 ] Vinod Kumar Vavilapalli commented on YARN-90: - Thanks for the patch, Song! Some quick comments: - Because you are changing the semantics of checkDirs(), there are more changes that are needed. -- updateDirsAfterFailure() -> updateConfAfterDirListChange? -- The log message in updateDirsAfterFailure: "Disk(s) failed. " should be changed to something like "Disk-health report changed: " or something like that. - Web UI and Web-services are fine for now I think, nothing to do there. - Drop the extraneous "System.out.println" lines in all of the patch. - Let's drop the metrics changes. We need to expose this end-to-end and not just metrics - client side reports, jmx and metrics. Worth tracking that effort separately. - Test: -- testAutoDir() -> testDisksGoingOnAndOff ? -- Can you also validate the health-report both when disks go off and when they come back again? -- Also just throw unwanted exceptions instead of catching them and printing stack-trace. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811724#comment-13811724 ] Ravi Prakash commented on YARN-90: -- Apart from DirectoryCollection changes, I think we should also update LocalDirAllocation.AllocatorPerContext. Maybe we should handle that in a separate JIRA. Anyway. I noticed that after this patch, although DirectoryCollection recovered the repaired directories, they were not actually used. I wonder if its something wrong with my test procedure or we need more changes. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811612#comment-13811612 ] Ravi Prakash commented on YARN-90: -- Hi Song! Thanks a lot for your offer to contribute. It would be great if you could please also share your patch. Could you please also clarify what "tt" you are referring to in "tt also adds a new metric of the number"? I will go ahead and test the pre-existing patch anyway. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810976#comment-13810976 ] Hou Song commented on YARN-90: -- Sorry for the last comment, I meant: For unit tests, I add a test to TestLocalDirsHandlerService, and mimic disk failure by "chmod 000 failed_dir", and mimic disk repairing by "chmod 755 failed_dir". > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810961#comment-13810961 ] Hou Song commented on YARN-90: -- Hi guys, I have been using my patch for this issue for a very long time. It enables NM to reuse failed diskes after they come back, and tt also adds a new metric of the number of failed directories so people have clearer view from outside. For unit tests, I add a test to TestLocalDirsHandlerService, and mimic disk failure by "chmod 000 failed_dir", and mimic disk repairment by "chmod 000 failed_dir". If anyone interested, I can post this patch here. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13781905#comment-13781905 ] Ravi Prakash commented on YARN-90: -- Hi nijel! For testing I would like to configure a USB drive to be one of the local + log dirs. We can then simulate failure by unplugging the USB drive. When we plug it back in, the NM should start using the "recovered" disk. Did you experience this behaviour yourself? I'll also try to test this soon as I get some cycles. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13777530#comment-13777530 ] nijel commented on YARN-90: --- Hi Ravi, Thanks for the comments Patch is updated with the comments About the test part, Executed the test cases in node manager project locally. Except 2 all are passing. Failures are not related to directory service. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13777082#comment-13777082 ] Ravi Prakash commented on YARN-90: -- Hi nijel! Welcome to the community and thanks for your contribution. A few comments: 1. Nit: Some lines are over 80 characters long. 2. numFailures is never incremented any more when the directory fails. Thus getNumFailures() would return the wrong result. Could you please also tell us how you tested the patch? There seem to be a lot of unit tests which use LocalDirsHandlerService. Did you run them all and ensure that they still all pass? Thanks again > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13776113#comment-13776113 ] nijel commented on YARN-90: --- To handle this we can check the failed dirs first in DirectoryCollection.checkDirs() and add back to localDirs if the directories are recovered from error. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13730850#comment-13730850 ] Ravi Prakash commented on YARN-90: -- Do we know what we need to do for this JIRA? I can see in DirectoryCollection, we need to be able to remove from failedDirs, and be able to recognize this fact in LocalDirsHandler service. Would anything else need to be done? > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira