[
https://issues.apache.org/jira/browse/YARN-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284243#comment-14284243
]
Jason Lowe commented on YARN-3074:
----------------------------------
Sample stacktrace:
{noformat}
2015-01-16 12:06:56,399 [LocalizerRunner for
container_1416815736267_3849544_01_000817] FATAL
yarn.YarnUncaughtExceptionHandler: Thread Thread[LocalizerRunner for
container_1416815736267_3849544_01_000817,5,main] threw an Error. Shutting
down now...
org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:226)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.FilterOutputStream.close(FilterOutputStream.java:157)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at
org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.close(ChecksumFs.java:366)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1125)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1068)
Caused by: java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:318)
at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:224)
... 10 more
{noformat}
Looks like the Hadoop filesystem layer "helpfully" changed what was originally
an IOException into an FSError. FSError is _not_ an Exception, so the
try...catch(Exception) block in LocalizerRunner.run doesn't catch it. It then
bubbles up to the top of the thread, and the uncaught exception handler kills
the whole process.
We should consider catching Throwable rather than Exception in
LocalizerRunner.run, or at least also catch FSError since it will be a common
and recoverable error in this case.
> Nodemanager dies when localizer runner tries to write to a full disk
> --------------------------------------------------------------------
>
> Key: YARN-3074
> URL: https://issues.apache.org/jira/browse/YARN-3074
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.5.0
> Reporter: Jason Lowe
>
> When a LocalizerRunner tries to write to a full disk it can bring down the
> nodemanager process. Instead of failing the whole process we should fail
> only the container and make a best attempt to keep going.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)