[ 
https://issues.apache.org/jira/browse/YARN-8775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16648045#comment-16648045
 ] 

Haibo Chen edited comment on YARN-8775 at 10/12/18 3:44 PM:
------------------------------------------------------------

Thanks for the update, [~bsteinbach]. A few comments

1) Let's add a @VisibleForTesting annotation for checkDirs() to indicate it is 
used only by tests, not public API

2) In prepareDirToFail(String dir), I think we shall fail fast by throwing an 
IOException if the file deletion is unsuccessful. Otherwise, the following 
file.createNewFile() would be unsuccessful and the directory would remain, 
rather than being replaced by a file with the same name.   That is, 
prepareDirToFail() won't do what it's supposed to do. It is harder to figure 
out why from the logs when unit tests fail.

3) "Make check interval high enough to never run it during the test." => "Set 
disk check interval high enough so that it never runs during the test."

4) The comment where the interval is set shall be updated. "// set disk health 
check interval to a small value (say 1 sec)." => "// set disk health check 
interval to a large value to effectively disable disk health check done 
internally in LocalDirsHandlerService"


was (Author: haibochen):
Thanks for the update, [~bsteinbach]. A few comments

1) Let's add a @VisibleForTesting annotation for checkDirs() to indicate it is 
used by tests, not public API

2) In prepareDirToFail(String dir), I think we shall fail fast by throwing an 
IOException if the file deletion is unsuccessful. Otherwise, the following 
file.createNewFile() would be unsuccessful and the directory would remain, 
rather than being replaced by a file with the same name.   That is, 
prepareDirToFail() won't do what it's supposed to do. It is harder to figure 
out why from the logs.

> TestDiskFailures.testLocalDirsFailures sometimes can fail on concurrent File 
> modifications
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-8775
>                 URL: https://issues.apache.org/jira/browse/YARN-8775
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: test, yarn
>    Affects Versions: 3.0.0
>            Reporter: Antal Bálint Steinbach
>            Assignee: Antal Bálint Steinbach
>            Priority: Major
>         Attachments: YARN-8775.001.patch, YARN-8775.002.patch, 
> YARN-8775.003.patch
>
>
> The test can fail sometimes when file operations were done during the check 
> done by the thread in _LocalDirsHandlerService._
> {code:java}
> java.lang.AssertionError: NodeManager could not identify disk failure.
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.assertTrue(Assert.java:41)
>       at 
> org.apache.hadoop.yarn.server.TestDiskFailures.verifyDisksHealth(TestDiskFailures.java:239)
>       at 
> org.apache.hadoop.yarn.server.TestDiskFailures.testDirsFailures(TestDiskFailures.java:202)
>       at 
> org.apache.hadoop.yarn.server.TestDiskFailures.testLocalDirsFailures(TestDiskFailures.java:99)
> Stderr
> 2018-09-13 08:21:49,822 INFO [main] server.TestDiskFailures 
> (TestDiskFailures.java:prepareDirToFail(277)) - Prepared 
> /tmp/dist-test-taskjUrf0_/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/target/org.apache.hadoop.yarn.server.TestDiskFailures/org.apache.hadoop.yarn.server.TestDiskFailures-logDir-nm-0_1
>  to fail.
> 2018-09-13 08:21:49,823 INFO [main] server.TestDiskFailures 
> (TestDiskFailures.java:prepareDirToFail(277)) - Prepared 
> /tmp/dist-test-taskjUrf0_/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/target/org.apache.hadoop.yarn.server.TestDiskFailures/org.apache.hadoop.yarn.server.TestDiskFailures-logDir-nm-0_3
>  to fail.
> 2018-09-13 08:21:49,823 WARN [DiskHealthMonitor-Timer] 
> nodemanager.DirectoryCollection (DirectoryCollection.java:checkDirs(283)) - 
> Directory 
> /tmp/dist-test-taskjUrf0_/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/target/org.apache.hadoop.yarn.server.TestDiskFailures/org.apache.hadoop.yarn.server.TestDiskFailures-logDir-nm-0_1
>  error, Not a directory: 
> /tmp/dist-test-taskjUrf0_/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/target/org.apache.hadoop.yarn.server.TestDiskFailures/org.apache.hadoop.yarn.server.TestDiskFailures-logDir-nm-0_1,
>  removing from list of valid directories
> 2018-09-13 08:21:49,824 WARN [DiskHealthMonitor-Timer] 
> localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:initializeLogDir(1329)) - Could not 
> initialize log dir 
> /tmp/dist-test-taskjUrf0_/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/target/org.apache.hadoop.yarn.server.TestDiskFailures/org.apache.hadoop.yarn.server.TestDiskFailures-logDir-nm-0_3
> java.io.FileNotFoundException: Destination exists and is not a directory: 
> /tmp/dist-test-taskjUrf0_/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/target/org.apache.hadoop.yarn.server.TestDiskFailures/org.apache.hadoop.yarn.server.TestDiskFailures-logDir-nm-0_3
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:515)
> at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:496)
> at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1081)
> at 
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:178)
> at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:205)
> at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:747)
> at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:743)
> at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
> at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:743)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.initializeLogDir(ResourceLocalizationService.java:1324)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.initializeLogDirs(ResourceLocalizationService.java:1318)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$000(ResourceLocalizationService.java:141)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$2.onDirsChanged(ResourceLocalizationService.java:269)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:317)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:452)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$500(LocalDirsHandlerService.java:52)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:166)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> 2018-09-13 08:21:59,824 INFO [main] server.TestDiskFailures 
> (TestDiskFailures.java:verifyDisksHealth(237)) - ExpectedDirs=
> 2018-09-13 08:21:59,825 INFO [main] server.TestDiskFailures 
> (TestDiskFailures.java:verifyDisksHealth(238)) - 
> SeenDirs=/tmp/dist-test-taskjUrf0_/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/target/org.apache.hadoop.yarn.server.TestDiskFailures/org.apache.hadoop.yarn.server.TestDiskFailures-logDir-nm-0_3
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to