[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2016-01-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119031#comment-15119031
 ] 

Hadoop QA commented on YARN-4301:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 42s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
48s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 48s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 7s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
36s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 27s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
39s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
28s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 39s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 43s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 43s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 6s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 6s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 34s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 14 new + 
231 unchanged - 0 fixed = 245 total (was 231) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 23s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
32s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 7 line(s) with tabs. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 8s 
{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 16s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 32s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 54s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 2m 57s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK 

[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-12-09 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048245#comment-15048245
 ] 

Tsuyoshi Ozawa commented on YARN-4301:
--

{quote}
it maybe change the behaviour of NM_MIN_HEALTHY_DISKS_FRACTION, could we add a 
timeout to mkdir? if mkdir timeout, the disk is treated as a failed disk.
{quote}

+1 for the suggestion by [~sandflee]. 

> NM disk health checker should have a timeout
> 
>
> Key: YARN-4301
> URL: https://issues.apache.org/jira/browse/YARN-4301
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akihiro Suda
>Assignee: Akihiro Suda
> Attachments: YARN-4301-1.patch, YARN-4301-2.patch, 
> concept-async-diskchecker.txt
>
>
> The disk health checker [verifies a 
> disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385]
>  by executing {{mkdir}} and {{rmdir}} periodically.
> If these operations does not return in a moderate timeout, the disk should be 
> marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}.
> I confirmed that current YARN does not have an implicit timeout (on JDK7, 
> Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our 
> fault injector for distributed systems.
> (I'll introduce the reproduction script in a while)
> I consider we can fix this issue by making 
> [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73]
>  return {{false}} if the value of {{this.getLastHealthReportTime()}} is too 
> old.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-12-08 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048221#comment-15048221
 ] 

Tsuyoshi Ozawa commented on YARN-4301:
--

[~suda] thank you for the point. I have some comments about v2 patch - could 
you update them?

1. About the synchronization of DirectoryCollection,  I got the point you 
mentioned. The change, however, causes race condition between states in the 
class(localDirs, fullDirs, errorDirs, and numFailures) - e.g. 
{{DirectoryCollection.concat(errorDirs, fullDirs))}}, {{createNonExistentDirs}} 
and other functions cannot work well without synchronization. 

I think the root cause of the problem is to calling {{DC.testDirs}} with 
synchronization in {{DC.checkDirs}}. How about releasing lock before calling 
{{testDirs}} and acquiring lock after calling {{testDirs}}?

{quote}
synchronized DC.getFailedDirs() can be blocked by synchronized DC.checkDirs(), 
when File.mkdir() (called from DC.checkDirs(), via DC.testDirs()) does not 
return in a moderate timeout.
Hence NodeHealthCheckerServer.isHealthy() gets also blocked.
So I would like to make DC.getXXXs unsynchronized.
{quote}

2. If the thread is preempted by OS and moves to another CPU in multicore 
environment, gap can be negative value. Hence I prefer not to abort NodeManager 
here.
{code:title=NodeHealthCheckerService.java}
+long diskCheckTime = dirsHandler.getLastDisksCheckTime();
+long now = System.currentTimeMillis();
+long gap = now - diskCheckTime;
+if (gap < 0) {
+  throw new AssertionError("implementation error - now=" + now
+  + ", diskCheckTime=" + diskCheckTime);
+}
{code}

3. Please move validations of configuration to serviceInit to avoid aborting at 
runtime.
{code:title=NodeHealthCheckerService.java}
+long allowedGap = this.diskHealthCheckInterval + 
this.diskHealthCheckTimeout;
+if (allowedGap <= 0) {
+  throw new AssertionError("implementation error - interval=" + 
this.diskHealthCheckInterval
+  + ", timeout=" + this.diskHealthCheckTimeout);
+}
{code}


> NM disk health checker should have a timeout
> 
>
> Key: YARN-4301
> URL: https://issues.apache.org/jira/browse/YARN-4301
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akihiro Suda
> Attachments: YARN-4301-1.patch, YARN-4301-2.patch, 
> concept-async-diskchecker.txt
>
>
> The disk health checker [verifies a 
> disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385]
>  by executing {{mkdir}} and {{rmdir}} periodically.
> If these operations does not return in a moderate timeout, the disk should be 
> marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}.
> I confirmed that current YARN does not have an implicit timeout (on JDK7, 
> Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our 
> fault injector for distributed systems.
> (I'll introduce the reproduction script in a while)
> I consider we can fix this issue by making 
> [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73]
>  return {{false}} if the value of {{this.getLastHealthReportTime()}} is too 
> old.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-12-08 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047902#comment-15047902
 ] 

Tsuyoshi Ozawa commented on YARN-4301:
--

[~suda] thank you for updating. The warning by findbugs looks related to the 
change. Could you fix it?

> NM disk health checker should have a timeout
> 
>
> Key: YARN-4301
> URL: https://issues.apache.org/jira/browse/YARN-4301
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akihiro Suda
> Attachments: YARN-4301-1.patch, YARN-4301-2.patch, 
> concept-async-diskchecker.txt
>
>
> The disk health checker [verifies a 
> disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385]
>  by executing {{mkdir}} and {{rmdir}} periodically.
> If these operations does not return in a moderate timeout, the disk should be 
> marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}.
> I confirmed that current YARN does not have an implicit timeout (on JDK7, 
> Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our 
> fault injector for distributed systems.
> (I'll introduce the reproduction script in a while)
> I consider we can fix this issue by making 
> [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73]
>  return {{false}} if the value of {{this.getLastHealthReportTime()}} is too 
> old.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-12-08 Thread Akihiro Suda (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047931#comment-15047931
 ] 

Akihiro Suda commented on YARN-4301:


The warning is for {{concept-async-diskchecker.txt}}, which is just a concept 
document, not a patch.

I didn't know that Yetus recognizes {{*.txt}} file as a patch.



> NM disk health checker should have a timeout
> 
>
> Key: YARN-4301
> URL: https://issues.apache.org/jira/browse/YARN-4301
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akihiro Suda
> Attachments: YARN-4301-1.patch, YARN-4301-2.patch, 
> concept-async-diskchecker.txt
>
>
> The disk health checker [verifies a 
> disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385]
>  by executing {{mkdir}} and {{rmdir}} periodically.
> If these operations does not return in a moderate timeout, the disk should be 
> marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}.
> I confirmed that current YARN does not have an implicit timeout (on JDK7, 
> Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our 
> fault injector for distributed systems.
> (I'll introduce the reproduction script in a while)
> I consider we can fix this issue by making 
> [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73]
>  return {{false}} if the value of {{this.getLastHealthReportTime()}} is too 
> old.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-12-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046489#comment-15046489
 ] 

Hadoop QA commented on YARN-4301:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 10s 
{color} | {color:blue} The patch file was not named according to hadoop's 
naming conventions. Please see https://wiki.apache.org/hadoop/HowToContribute 
for instructions. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 15s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 40s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
17s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 8s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
58s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 2s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 7s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
43s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 16s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 16s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 42s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 42s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 17s 
{color} | {color:red} Patch generated 8 new checkstyle issues in 
hadoop-common-project/hadoop-common (total was 7, now 14). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 6s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 12s 
{color} | {color:red} hadoop-common-project/hadoop-common introduced 1 new 
FindBugs issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 59s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 11s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 22m 8s {color} 
| {color:red} hadoop-common in the patch failed with JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 58s 
{color} | {color:green} hadoop-common in the patch passed with JDK v1.7.0_85. 
{color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 19s 
{color} | {color:red} Patch generated 3 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 91m 12s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-common-project/hadoop-common |
|  |  Should org.apache.hadoop.util.DiskChecker$AsyncMkdirCallable be a 
_static_ inner class?  At DiskChecker.java:inner 

[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-12-07 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046252#comment-15046252
 ] 

sandflee commented on YARN-4301:


it maybe change the behaviour of NM_MIN_HEALTHY_DISKS_FRACTION, could we add a 
timeout to mkdir? if mkdir timeout, the disk is treated as a failed disk.

> NM disk health checker should have a timeout
> 
>
> Key: YARN-4301
> URL: https://issues.apache.org/jira/browse/YARN-4301
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akihiro Suda
> Attachments: YARN-4301-1.patch, YARN-4301-2.patch
>
>
> The disk health checker [verifies a 
> disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385]
>  by executing {{mkdir}} and {{rmdir}} periodically.
> If these operations does not return in a moderate timeout, the disk should be 
> marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}.
> I confirmed that current YARN does not have an implicit timeout (on JDK7, 
> Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our 
> fault injector for distributed systems.
> (I'll introduce the reproduction script in a while)
> I consider we can fix this issue by making 
> [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73]
>  return {{false}} if the value of {{this.getLastHealthReportTime()}} is too 
> old.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-12-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046319#comment-15046319
 ] 

Hadoop QA commented on YARN-4301:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
32s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 12s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 20s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
29s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 34s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
39s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
51s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 31s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 54s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
25s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 8s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 8s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 19s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 19s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 29s 
{color} | {color:red} Patch generated 5 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 225, now 229). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
38s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
14s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 51s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 25s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 3s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 55s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 26s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_85. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 15s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_85. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 13s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_85. {color} |
| 

[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-12-07 Thread Akihiro Suda (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046396#comment-15046396
 ] 

Akihiro Suda commented on YARN-4301:


Hi [~sandflee], thanks for the comment.
I'll add a timeout to mkdir (and rmdir) as in 
{{concept-async-diskchecker.txt}}(in-progress)
.

> NM disk health checker should have a timeout
> 
>
> Key: YARN-4301
> URL: https://issues.apache.org/jira/browse/YARN-4301
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akihiro Suda
> Attachments: YARN-4301-1.patch, YARN-4301-2.patch, 
> concept-async-diskchecker.txt
>
>
> The disk health checker [verifies a 
> disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385]
>  by executing {{mkdir}} and {{rmdir}} periodically.
> If these operations does not return in a moderate timeout, the disk should be 
> marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}.
> I confirmed that current YARN does not have an implicit timeout (on JDK7, 
> Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our 
> fault injector for distributed systems.
> (I'll introduce the reproduction script in a while)
> I consider we can fix this issue by making 
> [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73]
>  return {{false}} if the value of {{this.getLastHealthReportTime()}} is too 
> old.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-11-18 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15011586#comment-15011586
 ] 

Tsuyoshi Ozawa commented on YARN-4301:
--

[~suda] thank you for reporting this issue. The policy of the patch looks good 
to me overall except removing synchronized block. Do you have any reason to do 
so?

Could you also add the test cases in the following test case?

> NM disk health checker should have a timeout
> 
>
> Key: YARN-4301
> URL: https://issues.apache.org/jira/browse/YARN-4301
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akihiro Suda
> Attachments: YARN-4301-1.patch
>
>
> The disk health checker [verifies a 
> disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385]
>  by executing {{mkdir}} and {{rmdir}} periodically.
> If these operations does not return in a moderate timeout, the disk should be 
> marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}.
> I confirmed that current YARN does not have an implicit timeout (on JDK7, 
> Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our 
> fault injector for distributed systems.
> (I'll introduce the reproduction script in a while)
> I consider we can fix this issue by making 
> [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73]
>  return {{false}} if the value of {{this.getLastHealthReportTime()}} is too 
> old.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-11-18 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012874#comment-15012874
 ] 

Tsuyoshi Ozawa commented on YARN-4301:
--

{quote}
 removing synchronized block
{quote}

I meant DirectoryCollection's accessors.

> NM disk health checker should have a timeout
> 
>
> Key: YARN-4301
> URL: https://issues.apache.org/jira/browse/YARN-4301
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akihiro Suda
> Attachments: YARN-4301-1.patch
>
>
> The disk health checker [verifies a 
> disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385]
>  by executing {{mkdir}} and {{rmdir}} periodically.
> If these operations does not return in a moderate timeout, the disk should be 
> marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}.
> I confirmed that current YARN does not have an implicit timeout (on JDK7, 
> Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our 
> fault injector for distributed systems.
> (I'll introduce the reproduction script in a while)
> I consider we can fix this issue by making 
> [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73]
>  return {{false}} if the value of {{this.getLastHealthReportTime()}} is too 
> old.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-11-18 Thread Akihiro Suda (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15013058#comment-15013058
 ] 

Akihiro Suda commented on YARN-4301:


Thank you for reviewing.

synchronized 
[DC.getFailedDirs()|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L202]
 can be blocked by synchronized 
[DC.checkDirs()|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L255],
 when File.mkdir() (called from DC.checkDirs(), via 
[DC.testDirs()|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L325])
 does not return in a moderate timeout.

Hence NodeHealthCheckerServer.isHealthy() gets also blocked.

So I would like to make DC.getXXXs unsynchronized.


> NM disk health checker should have a timeout
> 
>
> Key: YARN-4301
> URL: https://issues.apache.org/jira/browse/YARN-4301
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akihiro Suda
> Attachments: YARN-4301-1.patch
>
>
> The disk health checker [verifies a 
> disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385]
>  by executing {{mkdir}} and {{rmdir}} periodically.
> If these operations does not return in a moderate timeout, the disk should be 
> marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}.
> I confirmed that current YARN does not have an implicit timeout (on JDK7, 
> Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our 
> fault injector for distributed systems.
> (I'll introduce the reproduction script in a while)
> I consider we can fix this issue by making 
> [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73]
>  return {{false}} if the value of {{this.getLastHealthReportTime()}} is too 
> old.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-11-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003862#comment-15003862
 ] 

Hadoop QA commented on YARN-4301:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 7s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
21s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 52s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
25s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 24s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
40s {color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 19s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in 
trunk has 3 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 29s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 0s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
20s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 53s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 51s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 29s 
{color} | {color:red} Patch generated 5 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 223, now 227). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 20s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
39s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 41s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 4s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_60. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 5s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_60. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 54s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_60. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_79. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 8s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_79. 

[jira] [Commented] (YARN-4301) NM disk health checker should have a timeout

2015-10-28 Thread Akihiro Suda (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14979812#comment-14979812
 ] 

Akihiro Suda commented on YARN-4301:


Here is the reproduction script: 
https://github.com/osrg/earthquake/tree/1ceab663baec2b93ee7309b7369ba4f9dcf3a2c2/example/yarn/4301-reproduce


I'll submit a patch to fix the bug later.


> NM disk health checker should have a timeout
> 
>
> Key: YARN-4301
> URL: https://issues.apache.org/jira/browse/YARN-4301
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akihiro Suda
>
> The disk health checker [verifies a 
> disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385]
>  by executing {{mkdir}} and {{rmdir}} periodically.
> If these operations does not return in a moderate timeout, the disk should be 
> marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}.
> I confirmed that current YARN does not have an implicit timeout (on JDK7, 
> Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our 
> fault injector for distributed systems.
> (I'll introduce the reproduction script in a while)
> I consider we can fix this issue by making 
> [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73]
>  return {{false}} if the value of {{this.getLastHealthReportTime()}} is too 
> old.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)