[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119031#comment-15119031 ] Hadoop QA commented on YARN-4301: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 42s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 48s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 48s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 7s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 36s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 27s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 21s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 39s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 43s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 43s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 6s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 6s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 34s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 14 new + 231 unchanged - 0 fixed = 245 total (was 231) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 7 line(s) with tabs. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 8s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 16s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 32s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 54s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 2m 57s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s {color} | {color:green} hadoop-yarn-api in the patch pa
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048245#comment-15048245 ] Tsuyoshi Ozawa commented on YARN-4301: -- {quote} it maybe change the behaviour of NM_MIN_HEALTHY_DISKS_FRACTION, could we add a timeout to mkdir? if mkdir timeout, the disk is treated as a failed disk. {quote} +1 for the suggestion by [~sandflee]. > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda >Assignee: Akihiro Suda > Attachments: YARN-4301-1.patch, YARN-4301-2.patch, > concept-async-diskchecker.txt > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048221#comment-15048221 ] Tsuyoshi Ozawa commented on YARN-4301: -- [~suda] thank you for the point. I have some comments about v2 patch - could you update them? 1. About the synchronization of DirectoryCollection, I got the point you mentioned. The change, however, causes race condition between states in the class(localDirs, fullDirs, errorDirs, and numFailures) - e.g. {{DirectoryCollection.concat(errorDirs, fullDirs))}}, {{createNonExistentDirs}} and other functions cannot work well without synchronization. I think the root cause of the problem is to calling {{DC.testDirs}} with synchronization in {{DC.checkDirs}}. How about releasing lock before calling {{testDirs}} and acquiring lock after calling {{testDirs}}? {quote} synchronized DC.getFailedDirs() can be blocked by synchronized DC.checkDirs(), when File.mkdir() (called from DC.checkDirs(), via DC.testDirs()) does not return in a moderate timeout. Hence NodeHealthCheckerServer.isHealthy() gets also blocked. So I would like to make DC.getXXXs unsynchronized. {quote} 2. If the thread is preempted by OS and moves to another CPU in multicore environment, gap can be negative value. Hence I prefer not to abort NodeManager here. {code:title=NodeHealthCheckerService.java} +long diskCheckTime = dirsHandler.getLastDisksCheckTime(); +long now = System.currentTimeMillis(); +long gap = now - diskCheckTime; +if (gap < 0) { + throw new AssertionError("implementation error - now=" + now + + ", diskCheckTime=" + diskCheckTime); +} {code} 3. Please move validations of configuration to serviceInit to avoid aborting at runtime. {code:title=NodeHealthCheckerService.java} +long allowedGap = this.diskHealthCheckInterval + this.diskHealthCheckTimeout; +if (allowedGap <= 0) { + throw new AssertionError("implementation error - interval=" + this.diskHealthCheckInterval + + ", timeout=" + this.diskHealthCheckTimeout); +} {code} > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda > Attachments: YARN-4301-1.patch, YARN-4301-2.patch, > concept-async-diskchecker.txt > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047931#comment-15047931 ] Akihiro Suda commented on YARN-4301: The warning is for {{concept-async-diskchecker.txt}}, which is just a concept document, not a patch. I didn't know that Yetus recognizes {{*.txt}} file as a patch. > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda > Attachments: YARN-4301-1.patch, YARN-4301-2.patch, > concept-async-diskchecker.txt > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047902#comment-15047902 ] Tsuyoshi Ozawa commented on YARN-4301: -- [~suda] thank you for updating. The warning by findbugs looks related to the change. Could you fix it? > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda > Attachments: YARN-4301-1.patch, YARN-4301-2.patch, > concept-async-diskchecker.txt > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046489#comment-15046489 ] Hadoop QA commented on YARN-4301: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 10s {color} | {color:blue} The patch file was not named according to hadoop's naming conventions. Please see https://wiki.apache.org/hadoop/HowToContribute for instructions. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 15s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 40s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 8s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 58s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 2s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 7s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 43s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 16s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 42s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 42s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 17s {color} | {color:red} Patch generated 8 new checkstyle issues in hadoop-common-project/hadoop-common (total was 7, now 14). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 6s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 12s {color} | {color:red} hadoop-common-project/hadoop-common introduced 1 new FindBugs issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 59s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 11s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 22m 8s {color} | {color:red} hadoop-common in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 58s {color} | {color:green} hadoop-common in the patch passed with JDK v1.7.0_85. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 19s {color} | {color:red} Patch generated 3 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 91m 12s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-common-project/hadoop-common | | | Should org.apache.hadoop.util.DiskChecker$AsyncMkdirCallable be a _static_ inner class? At DiskCh
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046396#comment-15046396 ] Akihiro Suda commented on YARN-4301: Hi [~sandflee], thanks for the comment. I'll add a timeout to mkdir (and rmdir) as in {{concept-async-diskchecker.txt}}(in-progress) . > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda > Attachments: YARN-4301-1.patch, YARN-4301-2.patch, > concept-async-diskchecker.txt > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046319#comment-15046319 ] Hadoop QA commented on YARN-4301: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 32s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 12s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 20s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 29s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 34s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 51s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 31s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 54s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 8s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 8s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 19s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 19s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 29s {color} | {color:red} Patch generated 5 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 225, now 229). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 51s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 25s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 3s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 55s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 26s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 15s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 13s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_85. {color} | | {co
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046252#comment-15046252 ] sandflee commented on YARN-4301: it maybe change the behaviour of NM_MIN_HEALTHY_DISKS_FRACTION, could we add a timeout to mkdir? if mkdir timeout, the disk is treated as a failed disk. > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda > Attachments: YARN-4301-1.patch, YARN-4301-2.patch > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013058#comment-15013058 ] Akihiro Suda commented on YARN-4301: Thank you for reviewing. synchronized [DC.getFailedDirs()|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L202] can be blocked by synchronized [DC.checkDirs()|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L255], when File.mkdir() (called from DC.checkDirs(), via [DC.testDirs()|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L325]) does not return in a moderate timeout. Hence NodeHealthCheckerServer.isHealthy() gets also blocked. So I would like to make DC.getXXXs unsynchronized. > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda > Attachments: YARN-4301-1.patch > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012874#comment-15012874 ] Tsuyoshi Ozawa commented on YARN-4301: -- {quote} removing synchronized block {quote} I meant DirectoryCollection's accessors. > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda > Attachments: YARN-4301-1.patch > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15011586#comment-15011586 ] Tsuyoshi Ozawa commented on YARN-4301: -- [~suda] thank you for reporting this issue. The policy of the patch looks good to me overall except removing synchronized block. Do you have any reason to do so? Could you also add the test cases in the following test case? > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda > Attachments: YARN-4301-1.patch > > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003862#comment-15003862 ] Hadoop QA commented on YARN-4301: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 7s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 21s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 52s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 25s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 24s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 40s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 19s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in trunk has 3 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 29s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 0s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 53s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 51s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 29s {color} | {color:red} Patch generated 5 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 223, now 227). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 41s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 4s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 5s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 54s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_79. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 8s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK
[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout
[ https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979812#comment-14979812 ] Akihiro Suda commented on YARN-4301: Here is the reproduction script: https://github.com/osrg/earthquake/tree/1ceab663baec2b93ee7309b7369ba4f9dcf3a2c2/example/yarn/4301-reproduce I'll submit a patch to fix the bug later. > NM disk health checker should have a timeout > > > Key: YARN-4301 > URL: https://issues.apache.org/jira/browse/YARN-4301 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Akihiro Suda > > The disk health checker [verifies a > disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385] > by executing {{mkdir}} and {{rmdir}} periodically. > If these operations does not return in a moderate timeout, the disk should be > marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}. > I confirmed that current YARN does not have an implicit timeout (on JDK7, > Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our > fault injector for distributed systems. > (I'll introduce the reproduction script in a while) > I consider we can fix this issue by making > [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73] > return {{false}} if the value of {{this.getLastHealthReportTime()}} is too > old. -- This message was sent by Atlassian JIRA (v6.3.4#6332)