[
https://issues.apache.org/jira/browse/YARN-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akihiro Suda updated YARN-4301:
-------------------------------
Attachment: YARN-4301-3-fail.patch
Sorry for long break, I refactored the patch {{YARN-4301-3-fail.patch}}
The patch _fails_ due to a NPE.
This should be my very basic mistake about Java things, but unfortunately I'm
not sure what is wrong.
Please look into this?
{panel}
Running org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection
Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.801 sec <<<
FAILURE! - in org.apache.hadoop.yarn.server.nodem
anager.TestDirectoryCollection
testTimeout(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection)
Time elapsed: 0.018 sec <<< ERROR!
org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
java.util.concurrent.ExecutionException: java.lang.RuntimeException: st
range NPE
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:206)
at
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:404)
at
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:282)
at
org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testTimeout(TestDirectoryCollection.java:355)
Caused by: java.lang.RuntimeException: strange NPE
at
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection$AsyncTestDirsCallable.call(DirectoryCollection.java:38
1)
at
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection$AsyncTestDirsCallable.call(DirectoryCollection.java:34
8)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException: null
at
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection$AsyncTestDirsCallable.call(DirectoryCollection.java:37
9)
at
org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection$AsyncTestDirsCallable.call(DirectoryCollection.java:34
8)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Results :
Tests in error:
TestDirectoryCollection.testTimeout:355 ? YarnRuntime
java.util.concurrent.Exe...
Tests run: 8, Failures: 0, Errors: 1, Skipped: 0
{panel}
> NM disk health checker should have a timeout
> --------------------------------------------
>
> Key: YARN-4301
> URL: https://issues.apache.org/jira/browse/YARN-4301
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Akihiro Suda
> Assignee: Akihiro Suda
> Attachments: YARN-4301-1.patch, YARN-4301-2.patch,
> YARN-4301-3-fail.patch, concept-async-diskchecker.txt
>
>
> The disk health checker [verifies a
> disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385]
> by executing {{mkdir}} and {{rmdir}} periodically.
> If these operations does not return in a moderate timeout, the disk should be
> marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}.
> I confirmed that current YARN does not have an implicit timeout (on JDK7,
> Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our
> fault injector for distributed systems.
> (I'll introduce the reproduction script in a while)
> I consider we can fix this issue by making
> [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73]
> return {{false}} if the value of {{this.getLastHealthReportTime()}} is too
> old.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)