[jira] [Updated] (MAPREDUCE-2094) LineRecordReader should not seek into non-splittable, compressed streams.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated MAPREDUCE-2094: - Summary: LineRecordReader should not seek into non-splittable, compressed streams. (was: org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.) LineRecordReader should not seek into non-splittable, compressed streams. - Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: M2094.patch, MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-2015-05-05-2328.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-2094) LineRecordReader should not seek into non-splittable, compressed streams.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated MAPREDUCE-2094: - Attachment: M2094-1.patch Ran test-patch locally, all OK except a spurious whitespace and a release audit warning (fixed) LineRecordReader should not seek into non-splittable, compressed streams. - Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: M2094-1.patch, M2094.patch, MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-2015-05-05-2328.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-2094) LineRecordReader should not seek into non-splittable, compressed streams.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated MAPREDUCE-2094: - Resolution: Fixed Fix Version/s: 2.8.0 Release Note: (was: Throw an Exception in the most common error scenario present in many FileInputFormat derivatives that do not override isSplitable. ) Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) +1 I committed this to trunk and branch-2. Thanks Niels LineRecordReader should not seek into non-splittable, compressed streams. - Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Fix For: 2.8.0 Attachments: M2094-1.patch, M2094.patch, MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-2015-05-05-2328.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)