[
https://issues.apache.org/jira/browse/YARN-8054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408153#comment-16408153
]
Jason Lowe commented on YARN-8054:
----------------------------------
Thanks for the patch!
Do we really want to suppress the stack trace in the log message? Since the
code is using "+ t" rather than ", t" in the log line it will just print the
exception message and not show from whence it came. Very frustrating if the
message is something like "NullPointerException".
> Improve robustness of the LocalDirsHandlerService MonitoringTimerTask thread
> ----------------------------------------------------------------------------
>
> Key: YARN-8054
> URL: https://issues.apache.org/jira/browse/YARN-8054
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Jonathan Eagles
> Assignee: Jonathan Eagles
> Priority: Major
> Attachments: YARN-8054.001.patch
>
>
> The DeprecatedRawLocalFileStatus#loadPermissionInfo can throw a
> RuntimeException which can kill the MonitoringTimerTask thread. This can
> leave the node is a bad state where all NM local directories are marked "bad"
> and there is no automatic recovery. In the below can the error was "too many
> open files", but could be a number of other recoverable states.
> {noformat}
> 2018-03-18 02:37:42,960 [DiskHealthMonitor-Timer] ERROR
> yarn.YarnUncaughtExceptionHandler: Thread
> Thread[DiskHealthMonitor-Timer,5,main] threw an Exception.
> java.lang.RuntimeException: Error while running command to get file
> permissions : java.io.IOException: Cannot run program "ls": error=24, Too
> many open files
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:942)
> at org.apache.hadoop.util.Shell.run(Shell.java:898)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1307)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1289)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1078)
> at
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:697)
> at
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:672)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1556)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkAndInitializeLocalDirs(ResourceLocalizationService.java:1521)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$1.onDirsChanged(ResourceLocalizationService.java:271)
> at
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:381)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:449)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$500(LocalDirsHandlerService.java:52)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:166)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> Caused by: java.io.IOException: error=24, Too many open files
> at java.lang.UNIXProcess.forkAndExec(Native Method)
> at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
> at java.lang.ProcessImpl.start(ProcessImpl.java:134)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> ... 17 more
> at
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:737)
> at
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:672)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1556)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkAndInitializeLocalDirs(ResourceLocalizationService.java:1521)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$1.onDirsChanged(ResourceLocalizationService.java:271)
> at
> org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:381)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:449)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$500(LocalDirsHandlerService.java:52)
> at
> org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:166)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]