[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241788#comment-14241788 ] Gera Shegalov commented on MAPREDUCE-6166: -- [~eepayne], thanks for updating the patch. It might have gone stale. Please check and rebase it. {code} The patch does not appear to apply with p0 to p2 {code} > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt, MAPREDUCE-6166.v3.txt, > MAPREDUCE-6166.v4.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241681#comment-14241681 ] Hadoop QA commented on MAPREDUCE-6166: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org against trunk revision 2e98ad3. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5071//console This message is automatically generated. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt, MAPREDUCE-6166.v3.txt, > MAPREDUCE-6166.v4.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237297#comment-14237297 ] Hadoop QA commented on MAPREDUCE-6166: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685532/MAPREDUCE-6166-gera-missing-cs-test.patch against trunk revision 1b3bb9e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 72 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5062//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5062//artifact/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5062//console This message is automatically generated. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166-gera-missing-cs-test.patch, > MAPREDUCE-6166.v1.201411221941.txt, MAPREDUCE-6166.v2.201411251627.txt, > MAPREDUCE-6166.v3.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237016#comment-14237016 ] Gera Shegalov commented on MAPREDUCE-6166: -- Thanks, It makes sense, [~eepayne]. I just wanted to confirm that there are no existing tests catching missing checksum in on-disk shuffle ([confirmed!|by https://issues.apache.org/jira/browse/MAPREDUCE-6166?focusedCommentId=14236700&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14236700]). For the production code I have only one comment: I am now convinced {{FileSystem.getLocal(conf)}} should be {{FileSystem.getLocal(conf).getRaw()}} In the corresponding {{OnDiskMapOutput}} constructor. {{TestFetcher#testCorruptedFiles}} comments: Nit: Lower-case {{FETCHER}} because it's a final variable. {{FileSystem fs}} need to be changed {{FileSystem.getLocal(conf).getRaw()}}} just like in production code, and it could be made an instance variable because we can use it for cleanup in {{tearDown}}. {{Path p}} should be more mnemonic, and it it's better to use a root directory that matches the test method so we can use it for cleanup in {{tearDown}} {code} Path outputPath = new Path(name.getMethodName() + "/foo"); {code} Instead of reverse-engineering the path {{shuffledToDisk}}, we can use {code}Path shuffledToDisk = OnDiskMapOutput.getTempPath(outputPath, fetcher);{code} {quote} {code} 457 ios.write(mapData.getBytes()); 458 ios.close(); {code} {quote} {{ios.close()}} should be in a finally block. {quote} {code} 476 bin = new ByteArrayInputStream(corrupted); 477 // Read past the shuffle header. 478 bin.read(new byte[headerSize], 0, headerSize); {code} {quote} Move lines 477-478 inside the following try on 480. Drop {{fs.deleteOnExit}}. It comes too late in case there was an exception before, we should better do cleanup inside the {{tearDown}} method. {quote} {code} 491 IFileInputStream iFin = new IFileInputStream( 492 new FileInputStream(shuffledToDisk.toString()), dataSize, job); {code} {quote} It's probably better not to mix in java.io API if we already use Hadoop FileSystem API. Why not do: {code} 491 IFileInputStream iFin = new IFileInputStream(fs.open(shuffledToDisk), dataSize, job); {code} {quote} {code} 493 iFin.read(new byte[dataSize], 0, dataSize); 494 iFin.close(); 495 fs.close(); {code} {quote} Make sure to put 493, 494 in try/finally, accordingly. Since we are getting rid of {{fs.deleteOnExit}} we don't need {{fs.close}} > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166-gera-missing-cs-test.patch, > MAPREDUCE-6166.v1.201411221941.txt, MAPREDUCE-6166.v2.201411251627.txt, > MAPREDUCE-6166.v3.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14236822#comment-14236822 ] Eric Payne commented on MAPREDUCE-6166: --- Thank you very much, [~jira.shegalov] {quote} I'am uploading a modified patch based on my previous review only with the intention to see what if any tests would catch a missing checksum in ondisk-shuffle. {quote} The last segment of the test I added ({{TestFetcher#testCorruptedIFile}}) will catch that the checksum is missing or incorrect when it tries to read the IFile that was shuffled to disk by {{OnDiskMapOutput#shuffle}} > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166-gera-missing-cs-test.patch, > MAPREDUCE-6166.v1.201411221941.txt, MAPREDUCE-6166.v2.201411251627.txt, > MAPREDUCE-6166.v3.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14236700#comment-14236700 ] Hadoop QA commented on MAPREDUCE-6166: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685532/MAPREDUCE-6166-gera-missing-cs-test.patch against trunk revision e227fb8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5061//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5061//console This message is automatically generated. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166-gera-missing-cs-test.patch, > MAPREDUCE-6166.v1.201411221941.txt, MAPREDUCE-6166.v2.201411251627.txt, > MAPREDUCE-6166.v3.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235543#comment-14235543 ] Hadoop QA commented on MAPREDUCE-6166: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685168/MAPREDUCE-6166.v3.txt against trunk revision 0653918. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5060//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5060//console This message is automatically generated. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt, MAPREDUCE-6166.v3.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234498#comment-14234498 ] Gera Shegalov commented on MAPREDUCE-6166: -- [~eepayne], thanks for your reproducer. Would you mind uploading your patch to run it through Jenkins. If there is no test catching it, we should add it. Now it is more clear why we would need the checksum via IFile . Now I feel more strongly that we need to get rid of the checksumming LocalFileSystem just like in MapTask. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234461#comment-14234461 ] Eric Payne commented on MAPREDUCE-6166: --- Thanks [~jira.shegalov], for taking time to investigate this issue. The unit tests are not catching this. I am testing this in a 10-node secure cluster. I am running wordcount on a file that 1) has no repeated words and 2) is large enough to ensure that at least some of the map outputs are shuffled to disk: {code} $ $HADOOP_PREFIX/bin/hadoop fs -cat Input/NoRecurringWords/NoRecurringWords-part0.txt | wc -l -w 4008920 4008920 $ $HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-$HADOOP_VERSION.jar wordcount Input/NoRecurringWords/NoRecurringWords-part0.txt Output/01 {code} If I implement the fix by adjusting {{bytesLeft}} and leaving {{input.read()}} alone, the {{OnDiskMapOutput#shuffle}} does succeed and writes the map output to disk in a temporary location. However, when the {{Merger}} goes to read that temporary file (via {{RawKVIteratorReader}} in {{MergeManager}}) , it fails with the following exception: {code} 2014-12-04 17:47:12,040 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.fs.ChecksumException: Checksum Error at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:228) at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:152) at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:127) at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at org.apache.hadoop.io.IOUtils.wrappedReadForCompressedData(IOUtils.java:170) at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:378) at org.apache.hadoop.mapred.IFile$Reader.nextRawKey(IFile.java:426) at org.apache.hadoop.mapred.Merger$Segment.nextRawKey(Merger.java:337) at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:519) at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:547) at org.apache.hadoop.mapred.ReduceTask$4.next(ReduceTask.java:601) ... {code} This is because the checksum is not on the end of the temporary file. On the other hand, if I leave {{bytesLeft}} alone and instead call {{((IFileInputStream)input).readWithChecksum}}, the reducers all succeed. This is because {{readWithChecksum}} not only compares the input against the checksum, it also includes the checksum at the end of the byte buffer. Please let me know if that makes sense. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233666#comment-14233666 ] Gera Shegalov commented on MAPREDUCE-6166: -- Hi [~eepayne], sorry for the delay. I knew what modifications you were talking about. But I did not have the time to verify and convince myself whether this double checksumming was really needed in the Merger. I did though run a version of the patch that would implement my suggestion above through a couple of UT and did not see any issues. {code} mapreduce.task.reduce.TestFetcher mapreduce.task.reduce.TestMergeManager mapreduce.task.reduce.TestMerger {code} That's why I hoped you'd point me where a failure occurs. That's my current status about it. I hope to get back to it soon again. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233613#comment-14233613 ] Eric Payne commented on MAPREDUCE-6166: --- [~jira.shegalov], did that make sense? > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230426#comment-14230426 ] Eric Payne commented on MAPREDUCE-6166: --- I'm sorry. The second code snippet should have been this: {code} input = new IFileInputStream(input, compressedLength, conf); // Copy data to local-disk long bytesLeft = compressedLength; try { final int BYTES_TO_READ = 64 * 1024; byte[] buf = new byte[BYTES_TO_READ]; while (bytesLeft > 0) { int n = ((IFileInputStream)input).readWithChecksum(buf, 0, (int) Math.min(bytesLeft, BYTES_TO_READ)); ... {code} > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230407#comment-14230407 ] Eric Payne commented on MAPREDUCE-6166: --- [~jira.shegalov], I'm sorry for not being clear. {quote} Can you clarify where in the code it's required to keep the original checksum? Then this contents are written out using {{LocalFileSystem}}, which will create again an on-disk checksum because it's based on {{ChecksumFileSystem}}. {quote} I don't think the {{IFile}} format is related to {{ChecksumFileSystem}}. The {{IFile}} checksum is expected to be the last 4 bytes of the {{IFile}}, and if we use {{input.read}} as below, those 4 bytes of checksum are not copied into {{buf}}: {code} input = new IFileInputStream(input, compressedLength, conf); // Copy data to local-disk long bytesLeft = compressedLength - ((IFileInputStream)input).getSize(); try { final int BYTES_TO_READ = 64 * 1024; byte[] buf = new byte[BYTES_TO_READ]; while (bytesLeft > 0) { int n = input.read(buf, 0, (int) Math.min(bytesLeft, BYTES_TO_READ)); ... {code} However, if we use {{readWithChecksum}} as below, the checksum is copied into {{buf}}: {code} input = new IFileInputStream(input, compressedLength, conf); // Copy data to local-disk long bytesLeft = compressedLength; try { final int BYTES_TO_READ = 64 * 1024; byte[] buf = new byte[BYTES_TO_READ]; while (bytesLeft > 0) { int n = ((IFileInputStream)input).read(buf, 0, (int) Math.min(bytesLeft, BYTES_TO_READ)); ... {code} Without those last 4 bytes of checksum on the end of the {{IFile}} format, the final read will fail during the last merge pass with a chedksum error. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229432#comment-14229432 ] Gera Shegalov commented on MAPREDUCE-6166: -- Thanks for commenting [~eepayne]! bq. Since OnDiskMapOutput is shuffling the whole IFile to disk, the checksum is needed later during the last merge pass when the IFile contents are read again and decompressed. Can you clarify where in the code it's required to keep the original checksum? What I see is that after your modifications, {{OnDiskMapOutput}} is guaranteed to validate the contents of the destination buffer against the remote checksum. Then this contents are written out using {{LocalFileSystem}}, which will create again an on-disk checksum because it's based on {{ChecksumFileSystem}}. Are you proposing an optimization that the checksum is not computed twice when shuffling straight to disk by using {{RawLocalFileSystem}}? Can we defer it to another JIRA? > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226753#comment-14226753 ] Eric Payne commented on MAPREDUCE-6166: --- Just to clarify, neither {{read}} nor {{readWithChecksum}} writes anything to disk. They both read data into the byte buffer, which then is written to disk by {{shuffle}}. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226745#comment-14226745 ] Eric Payne commented on MAPREDUCE-6166: --- [~jira.shegalov], I have one question about re-using the {{input.read}} code in {{OnDisMapOutput}}. {quote} We can set {code} long bytesLeft = compressedLength - ((IFileInputStream)input).getSize() {code} Then we don't need to touch the line {{input.read}} to do {{readWithChecksum}} {quote} In this case, {{input.read}} does not write the checksum to the disk while {{readWithChecksum}} will write it. Since {{OnDiskMapOutput}} is shuffling the whole IFile to disk, the checksum is needed later during the last merge pass when the IFile contents are read again and decompressed. If we were to implement {{input.read}} as above, it looks like we would still need to add something like the following in order to put the checksum on the disk: {code} disk.write(((IFileInputStream)input).getChecksum(), 0, (int) ((IFileInputStream)input).getSize()); {code} > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226322#comment-14226322 ] Eric Payne commented on MAPREDUCE-6166: --- [~jira.shegalov]], thank you very much for your detailed analysis of this patch. I have opened MAPREDUCE-6174 to cover the parent class for {{InMemoryMapOutput}} and {{OnDiskMapOutput}}, and I will continue to work on the above-mentioned code points. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225864#comment-14225864 ] Gera Shegalov commented on MAPREDUCE-6166: -- Sounds good, [~eepayne]. I have a few comments then. Some are in light of a follow-up JIRA. bq. update this patch with the final keyword on JobConf jobConf Let us make the instance variable type a more general {{Configuration}} as we are not doing anything specific to {{JobConf}}. Instead of introducing a new local variable iFin in {{OnDiskMapOutput#shuffle}}, we can overwrite it as in {{InMemoryMapOutput#shuffle}}. We can either capture the shuffle size in an instance variable as {{InMemoryMapOutput}} does implicitly via {{memory.length}}. Or we can set {code} long bytesLeft = compressedLength - ((IFileInputStream)input).getSize() {code} Then we don't need to touch the line {{input.read}} to do {{readWithChecksum}} Good call adding {{finally}} with {{close}}. I also have some comments for the test: {{ios.finish()}} should be removed because it's redundant: {{IFileOutputStream#close()}} will call it as well. We don't need PrintStream wrapping and we need to be careful not to leak file descriptors in case I/O fails. {code} new PrintStream(fout).print(bout.toString()); fout.close(); {code} Should be something like: {code} try { fout.write(bout.toByteArray()); } finally { fout.close(); } {code} Similarly we need to make sure that {{fin.close()}} is in a try-finally block enclosing header and shuffle read. Let us not do {code} catch(Exception e) { fail("OnDiskMapOutput.shuffle did not process the map partition file"); {code} It's redundant because the exception is failing the test already. Same PrintStream and fout.close remarks for the code creating the corrupted file {{dataSize/2}}: I believe Sun Java Coding Style require spaces around arithmetic operations. In the fragment where we expect the checksum to fail, {{fin.close()}} should be in some finally. {{catch(Exception e)}} is too broad. Let us be more specific and maybe even log it: {code} } catch(ChecksumException e) { LOG.info("Expected checksum exception thrown.", e); } {code} Thinking a bit more about the file.out, it does not seem to be cleaned up after the test has finished. But we probably don't even need to create files, we can simply use {{new ByteArrayInputStream(bout.toByteArray())}} and {{new ByteArrayInputStream(corrupted)}} as input. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224834#comment-14224834 ] Hadoop QA commented on MAPREDUCE-6166: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683592/MAPREDUCE-6166.v2.201411251627.txt against trunk revision 61a2510. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 eclipse:eclipse{color}. The patch failed to build with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5050//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5050//console This message is automatically generated. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, > MAPREDUCE-6166.v2.201411251627.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223448#comment-14223448 ] Gera Shegalov commented on MAPREDUCE-6166: -- This patch adds one more common instance field, configuration: {{JobConf jobConf}} :). It should be final by the way. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223433#comment-14223433 ] Jason Lowe commented on MAPREDUCE-6166: --- I'm not sure we need all the boilerplate of an extra class to save one line of code (two if we count the MergeManager member), and I'm not sure that extra class alone will make it clear that MapOutput can be used externally. IMHO if we want to do this kind of refactoring then that can be done as another JIRA. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223368#comment-14223368 ] Gera Shegalov commented on MAPREDUCE-6166: -- Hi [~jlowe], thanks for pointing out the 3rd-party use cases, I completely forgot about it. So how about we make it explicit that InMemoryMapOutput and OnDiskMapOutput are different from 3rd-party (so I don't forget it next time) by having it subclass a common class. We can put there the common IFileInputStream wrapping logic, and maybe even move {{private final MergeManagerImpl merger;}}. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223009#comment-14223009 ] Jason Lowe commented on MAPREDUCE-6166: --- I'd be a little wary of doing this. I believe the MergeManager and MapOutput classes are being used by third-party software like SyncSort, see MAPREDUCE-4808, MAPREDUCE-4039, and related JIRAs. By changing the input stream being passed to mapOutput.shuffle to an IFileInputStream then calling read() on the data subtly changes the behavior. Before it was an IFileInput stream, calling read() would read all the data and the checksum. After it's wrapped at a higher level it won't. If the third-party software is itself wrapping the stream with IFileInputStream to handle the trailing checksum then after this change the stream would be double-wrapped and checksum verification would fail. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222194#comment-14222194 ] Gera Shegalov commented on MAPREDUCE-6166: -- Hi [~eepayne], thanks for reporting the issue. If you look at {{InMemoryMapOutput#shuffle}}, first thing it does is overwriting the passed InputStream with the IFileInputStream-wrapped version of it. So if we simply move this logic from there to the caller of {{mapOutput.shuffle}}, i.e., {{Fetcher#setupShuffleConnection}}, this common behavior is automatically consumed by both InMemory and OnDisk and we don't have to modify the latter. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
[ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222161#comment-14222161 ] Hadoop QA commented on MAPREDUCE-6166: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683171/MAPREDUCE-6166.v1.201411221941.txt against trunk revision a4df9ee. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5044//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5044//console This message is automatically generated. > Reducers do not catch bad map output transfers during shuffle if data > shuffled directly to disk > --- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 2.6.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt > > > In very large map/reduce jobs (5 maps, 2500 reducers), the intermediate > map partition output gets corrupted on disk on the map side. If this > corrupted map output is too large to shuffle in memory, the reducer streams > it to disk without validating the checksum. In jobs this large, it could take > hours before the reducer finally tries to read the corrupted file and fails. > Since retries of the failed reduce attempt will also take hours, this delay > in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)