[ 
https://issues.apache.org/jira/browse/YARN-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284243#comment-14284243
 ] 

Jason Lowe commented on YARN-3074:
----------------------------------

Sample stacktrace:
{noformat}
2015-01-16 12:06:56,399 [LocalizerRunner for 
container_1416815736267_3849544_01_000817] FATAL 
yarn.YarnUncaughtExceptionHandler: Thread Thread[LocalizerRunner for 
container_1416815736267_3849544_01_000817,5,main] threw an Error.  Shutting 
down now...
org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
        at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:226)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:157)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
        at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
        at 
org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.close(ChecksumFs.java:366)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
        at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1125)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1068)
Caused by: java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:318)
        at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:224)
        ... 10 more
{noformat}

Looks like the Hadoop filesystem layer "helpfully" changed what was originally 
an IOException into an FSError.  FSError is _not_ an Exception, so the 
try...catch(Exception) block in LocalizerRunner.run doesn't catch it.  It then 
bubbles up to the top of the thread, and the uncaught exception handler kills 
the whole process.

We should consider catching Throwable rather than Exception in 
LocalizerRunner.run, or at least also catch FSError since it will be a common 
and recoverable error in this case.

> Nodemanager dies when localizer runner tries to write to a full disk
> --------------------------------------------------------------------
>
>                 Key: YARN-3074
>                 URL: https://issues.apache.org/jira/browse/YARN-3074
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>
> When a LocalizerRunner tries to write to a full disk it can bring down the 
> nodemanager process.  Instead of failing the whole process we should fail 
> only the container and make a best attempt to keep going.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to