Tsuyoshi Ozawa commented on YARN-4301:

[~suda] thank you for reporting this issue. The policy of the patch looks good 
to me overall except removing synchronized block. Do you have any reason to do 

Could you also add the test cases in the following test case?

> NM disk health checker should have a timeout
> --------------------------------------------
>                 Key: YARN-4301
>                 URL: https://issues.apache.org/jira/browse/YARN-4301
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Akihiro Suda
>         Attachments: YARN-4301-1.patch
> The disk health checker [verifies a 
> disk|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java#L371-L385]
>  by executing {{mkdir}} and {{rmdir}} periodically.
> If these operations does not return in a moderate timeout, the disk should be 
> marked bad, and thus {{nodeInfo.nodeHealthy}} should flip to {{false}}.
> I confirmed that current YARN does not have an implicit timeout (on JDK7, 
> Linux 4.2, ext4) using [Earthquake|https://github.com/osrg/earthquake], our 
> fault injector for distributed systems.
> (I'll introduce the reproduction script in a while)
> I consider we can fix this issue by making 
> [{{NodeHealthCheckerServer.isHealthy()}}|https://github.com/apache/hadoop/blob/96677bef00b03057038157efeb3c2ad4702914da/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java#L69-L73]
>  return {{false}} if the value of {{this.getLastHealthReportTime()}} is too 
> old.

This message was sent by Atlassian JIRA

Reply via email to