[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved MAPREDUCE-4852.
-----------------------------------

    Resolution: Duplicate

This was fixed by MAPREDUCE-5251.

> Reducer should not signal fetch failures for disk errors on the reducer's side
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4852
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4852
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Jason Lowe
>
> Ran across a case where a reducer ran on a node where the disks were full, 
> leading to an exception like this during the shuffle fetch:
> {noformat}
> 2012-12-05 09:07:28,749 INFO [fetcher#25] 
> org.apache.hadoop.mapreduce.task.reduce.MergeManager: 
> attempt_1352354913026_138167_m_000654_0: Shuffling to disk since 235056188 is 
> greater than maxSingleShuffleLimit (155104064)
> 2012-12-05 09:07:28,755 INFO [fetcher#25] 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#25 failed to read 
> map headerattempt_1352354913026_138167_m_000654_0 decomp: 235056188, 101587629
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
> valid local directory for 
> output/attempt_1352354913026_138167_r_000189_0/map_654.out
>       at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
>       at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
>       at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
>       at 
> org.apache.hadoop.mapred.YarnOutputFiles.getInputFileForWrite(YarnOutputFiles.java:213)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.MapOutput.<init>(MapOutput.java:81)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:245)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:348)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:283)
>       at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:155)
> 2012-12-05 09:07:28,755 WARN [fetcher#25] 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: copyMapOutput failed for 
> tasks [attempt_1352354913026_138167_m_000654_0]
> 2012-12-05 09:07:28,756 INFO [fetcher#25] 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: Reporting fetch 
> failure for attempt_1352354913026_138167_m_000654_0 to jobtracker.
> {noformat}
> Even though the error was local to the reducer, it reported the error as a 
> fetch failure to the AM than failing the reducer itself.  It then proceeded 
> to run into the same error for many other maps, causing them to relaunch from 
> reported fetch failures.  In this case it would have been better to fail the 
> reducer and try another node rather than blame the mapper for what is an 
> error on the reducer's side.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to