[ https://issues.apache.org/jira/browse/MAPREDUCE-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Lowe resolved MAPREDUCE-4852. ----------------------------------- Resolution: Duplicate This was fixed by MAPREDUCE-5251. > Reducer should not signal fetch failures for disk errors on the reducer's side > ------------------------------------------------------------------------------ > > Key: MAPREDUCE-4852 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4852 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Reporter: Jason Lowe > > Ran across a case where a reducer ran on a node where the disks were full, > leading to an exception like this during the shuffle fetch: > {noformat} > 2012-12-05 09:07:28,749 INFO [fetcher#25] > org.apache.hadoop.mapreduce.task.reduce.MergeManager: > attempt_1352354913026_138167_m_000654_0: Shuffling to disk since 235056188 is > greater than maxSingleShuffleLimit (155104064) > 2012-12-05 09:07:28,755 INFO [fetcher#25] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#25 failed to read > map headerattempt_1352354913026_138167_m_000654_0 decomp: 235056188, 101587629 > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any > valid local directory for > output/attempt_1352354913026_138167_r_000189_0/map_654.out > at > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131) > at > org.apache.hadoop.mapred.YarnOutputFiles.getInputFileForWrite(YarnOutputFiles.java:213) > at > org.apache.hadoop.mapreduce.task.reduce.MapOutput.<init>(MapOutput.java:81) > at > org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:245) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:348) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:283) > at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:155) > 2012-12-05 09:07:28,755 WARN [fetcher#25] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: copyMapOutput failed for > tasks [attempt_1352354913026_138167_m_000654_0] > 2012-12-05 09:07:28,756 INFO [fetcher#25] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: Reporting fetch > failure for attempt_1352354913026_138167_m_000654_0 to jobtracker. > {noformat} > Even though the error was local to the reducer, it reported the error as a > fetch failure to the AM than failing the reducer itself. It then proceeded > to run into the same error for many other maps, causing them to relaunch from > reported fetch failures. In this case it would have been better to fail the > reducer and try another node rather than blame the mapper for what is an > error on the reducer's side. -- This message was sent by Atlassian JIRA (v6.2#6252)