[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531362#comment-14531362 ] Niels Basjes commented on MAPREDUCE-2094: - I understand you want the error message to be 'clean'. Normally I would do that too. This message can however only appear if you are using (or have been using for a long time) a (usually custom) FileInputFormat that has been corrupting your results (for perhaps even years ... note I created this bug report about 4.5 years ago). I think it is important to clarify the impact of the problem the author of the custom code introduced themselves so they immediately understand what went wrong. And yes... this message is bit over the top ... Atleast let's make the message easier to understand what went wrong and make aware of the historical implications. How about {{A split was attempted for a file that is being decompressed by + codec.getClass().getSimpleName() + which does not support splitting. Note that this would have corrupted the data in older Hadoop versions.}} org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Labels: BB2015-05-TBR Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-2015-05-05-2328.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531367#comment-14531367 ] Niels Basjes commented on MAPREDUCE-2094: - P.S. I'm unable to provide an updated patch for the next week or so. So please make up a good message and so people can start catching this problem. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Labels: BB2015-05-TBR Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-2015-05-05-2328.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531171#comment-14531171 ] Chris Douglas commented on MAPREDUCE-2094: -- [Given|http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201405.mbox/%3CCADoiZqoBKme-HYoM%3DhRxPEs1w2qdevo0%3DaoihqiWT4vS8D42Yg%40mail.gmail.com%3E] [discussion|https://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201406.mbox/%3ccadoizqoqkpn_7b9w75dcrvjxz1sqbkryqbrwlw1rwo26a4e...@mail.gmail.com%3E] on the dev list, the following error message: {noformat} + throw new IOException( +Implementation bug in the used FileInputFormat: + +The isSplitable method returned 'true' on a file that + +was compressed with a non splittable compression codec. + +If you get this right after upgrading Hadoop then know + +that you have been looking at reports based on + +corrupt data for a long time !!! (see: MAPREDUCE-2094)); {noformat} is a little over the top. Please just report the error detected e.g., {{Cannot seek in + codec.getClass().getSimpleName() + compressed stream}} org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Labels: BB2015-05-TBR Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-2015-05-05-2328.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530935#comment-14530935 ] Allen Wittenauer commented on MAPREDUCE-2094: - We're talking about the whitespace plugin being broken over in HADOOP-11923. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Labels: BB2015-05-TBR Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-2015-05-05-2328.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529382#comment-14529382 ] Hadoop QA commented on MAPREDUCE-2094: -- \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 57s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 7m 40s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 47s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 53s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 16s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | mapreduce tests | 1m 38s | Tests passed in hadoop-mapreduce-client-core. | | | | 38m 46s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730617/MAPREDUCE-2094-2015-05-05-2328.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 0100b15 | | whitespace | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5648/artifact/patchprocess/whitespace.txt | | hadoop-mapreduce-client-core test log | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5648/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt | | Test Results | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5648/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5648/console | This message was automatically generated. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-2015-05-05-2328.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e.
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529260#comment-14529260 ] Hadoop QA commented on MAPREDUCE-2094: -- \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730601/MAPREDUCE-2094-20140727-svn-fixed-spaces.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 0100b15 | | Console output | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5647/console | This message was automatically generated. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525562#comment-14525562 ] Hadoop QA commented on MAPREDUCE-2094: -- \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 31s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 7m 27s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 37s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 3s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 1s | The patch has 9 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 15s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | mapreduce tests | 1m 34s | Tests passed in hadoop-mapreduce-client-core. | | | | 38m 0s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12658039/MAPREDUCE-2094-20140727-svn.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 6ae2a0d | | whitespace | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5609/artifact/patchprocess/whitespace.txt | | hadoop-mapreduce-client-core test log | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5609/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt | | Test Results | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5609/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5609/console | This message was automatically generated. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075606#comment-14075606 ] Niels Basjes commented on MAPREDUCE-2094: - Question for [~gian], [~ggoodson] and others who have faced this problem. Did you use the 'out of the box' LineRecordReader (or a subclass of that) as shown in https://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat or did you write something 'completely new'? When I ran into this I followed the tutorial. I think the 'next best' spot to stop most problem scenarios (i.e. everyone who implements it like the tutorial shows) can be caught by letting the LineRecordReader fail the entire job when it is initialized with non splittable codec file and the provided split is not the entire file. What do you think? org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075633#comment-14075633 ] Gian Merlino commented on MAPREDUCE-2094: - I did use the LineRecordReader. I think a safety net like the one you described would have been useful. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075706#comment-14075706 ] Hadoop QA commented on MAPREDUCE-2094: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console This message is automatically generated. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075752#comment-14075752 ] Hadoop QA commented on MAPREDUCE-2094: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658039/MAPREDUCE-2094-20140727-svn.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4772//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4772//console This message is automatically generated. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648652#comment-13648652 ] Garth Goodson commented on MAPREDUCE-2094: -- We just hit this bug. I'm very surprised that it hasn't been fixed. Having a default isSplitable that just returns true is wrong behavior. This ended up causing data corruption for our customers. The default behavior should check whether the codec is splitable like other input formats do. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036388#comment-13036388 ] Hadoop QA commented on MAPREDUCE-2094: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479785/MAPREDUCE-2094-2011-05-19.patch against trunk revision 1124553. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/273//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/273//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/273//console This message is automatically generated. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036399#comment-13036399 ] Todd Lipcon commented on MAPREDUCE-2094: I think this is still incompatible. I think the majority of file-based input formats *should* be splittable - it's the whole reason that MR scales to big files. It's the minority of cases where files shouldn't be splittable. IMO the correct fix for this issue is to do a better job of documenting FileInputFormat.isSplitable() and the class-level javadoc on FileInputFormat. This way, those who implement it themselves will take note that they need to override it. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918851#action_12918851 ] Niels Basjes commented on MAPREDUCE-2094: - I just noticed that the Yahoo Hadoop tutorial [Module 5: Advanced MapReduce Features |http://developer.yahoo.com/hadoop/tutorial/module5.html]; shows a code example for defining your own [FileInputFormat|http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat]. The shown example code implements a derivative using FileInputFormat and LineRecordReader without overruling isSplittable ... I expect this tutorial code to lead people into this bug. Since this bug will only become apparent when using large non splittable (gzipped) input files it is also important to notice that almost no one will have a (unit) test that will trip on this bug. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 0.20.1, 0.20.2, 0.21.0 Reporter: Niels Basjes When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916076#action_12916076 ] Niels Basjes commented on MAPREDUCE-2094: - bq. Changing the default implementation to return false would be an incompatible change, potentially breaking existing subclasses. If you mean with breaking : Some subclasses will see an unexpected performance degradation . then Yes, that will most likely occur ( first one I think of is SequenceFileInputFormat ). I however do not expect any functional breaking of the output of these subclasses. bq. Making the method abstract would also be incompatible and break subclasses, but in a way that they'd easily detect. Yes. The downside of this option is that if subclasses want to have detection depending on the compression I expect a lot of code duplication to occur. This code duplication is already occurring within the main code base in KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat and their old API counterparts (I found a total of 5 duplicates of the same isSplittable implementation). bq. Perhaps the javadoc should just be clarified to better document this default? Definitely an option. However this would not fix the effect in the existing subclasses. I just did a quick manual code check of the current trunk and I found that the following classes are derived from FileInputFormat yet do not implement the isSplittable method (and thus use return true). * ./src/java/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.java * ./src/contrib/streaming/src/java/org/apache/hadoop/streaming/AutoInputFormat.java * ./src/java/org/apache/hadoop/mapred/SequenceFileInputFormat.java I expect that the NLineInputFormat and the AutoInputFormat will affected by this large gzip bug. So expect that simply fixing the isSplittable documentation would lead to the need to fix *at least* these two classes. As far as I understand the SequenceFileInputFormat can only be compressed using a splittable compression, so the return true; from FileInputFormat will work fine there. Overall I still prefer the clean option of returning the correct value depending on the compression. That would effectively leave the behavior in most use cases unchanged. Yet in those cases where splitting is known to cause problems it would avoid those problems. Thus avoiding major issues like the ones we had and described in HADOOP-6901. For the SequenceFileInputFormat it may be needed to implement the isSplittable as return true; Effectively the set of changes I propose (in both the old and new API versions of these classes): 1) FileInputFormat . isSplittable gets the implementation as seen in TextInputFormat 2) The isSplittable implementation is removed from KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat (useless code duplication) 3) The isSplittable implementation return true is added to SequenceFileInputFormat. Given the fact that you cannot gzip a sequencefile I expect this to be an optional fix. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 0.20.1, 0.20.2, 0.21.0 Reporter: Niels Basjes When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order