[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531362#comment-14531362 ] Niels Basjes commented on MAPREDUCE-2094: - I understand you want the error message to be 'clean'. Normally I would do that too. This message can however only appear if you are using (or have been using for a long time) a (usually custom) FileInputFormat that has been corrupting your results (for perhaps even years ... note I created this bug report about 4.5 years ago). I think it is important to clarify the impact of the problem the author of the custom code introduced themselves so they immediately understand what went wrong. And yes... this message is bit over the top ... Atleast let's make the message easier to understand what went wrong and make aware of the historical implications. How about {{A split was attempted for a file that is being decompressed by + codec.getClass().getSimpleName() + which does not support splitting. Note that this would have corrupted the data in older Hadoop versions.}} org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Labels: BB2015-05-TBR Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-2015-05-05-2328.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531367#comment-14531367 ] Niels Basjes commented on MAPREDUCE-2094: - P.S. I'm unable to provide an updated patch for the next week or so. So please make up a good message and so people can start catching this problem. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Labels: BB2015-05-TBR Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-2015-05-05-2328.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-2094: Attachment: MAPREDUCE-2094-20140727-svn-fixed-spaces.patch I removed the trailing spaces from the lines I touched. One line that was reported however is a line I didn't touch in my patch. Question: What should I do about that? org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-2094: Status: Open (was: Patch Available) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-2094: Status: Patch Available (was: Open) Removed the trailing spaces from the lines I touched. @[~aw]: Apparently the trailing spaces check is also triggered by the trailing space in one of the 'surrounding' lines in the patch file. How should this be handled? org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-2094: Attachment: MAPREDUCE-2094-2015-05-05-2328.patch This should work org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-2015-05-05-2328.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-2094: Status: Patch Available (was: Open) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-2015-05-05-2328.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075606#comment-14075606 ] Niels Basjes commented on MAPREDUCE-2094: - Question for [~gian], [~ggoodson] and others who have faced this problem. Did you use the 'out of the box' LineRecordReader (or a subclass of that) as shown in https://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat or did you write something 'completely new'? When I ran into this I followed the tutorial. I think the 'next best' spot to stop most problem scenarios (i.e. everyone who implements it like the tutorial shows) can be caught by letting the LineRecordReader fail the entire job when it is initialized with non splittable codec file and the provided split is not the entire file. What do you think? org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-2094: Attachment: MAPREDUCE-2094-20140727.patch This is the patch I created to throw an exception in the most common case where the described problem occurs. So this is a Fail hard when things already went wrong patch. NOTE 1: This patch includes all the JavaDoc improvements created by [~gian]. NOTE 2: To create a unit test I needed to include a small gzipped file in the patch. I hope I did that the right way. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-2094: Release Note: Throw an Exception in the most common error scenario present in many FileInputFormat derivatives that do not override isSplitable. (was: Fixed splitting errors present in many FileInputFormat derivatives that do not override isSplitable. ) Status: Patch Available (was: Open) This Fix does not solve the problem in all scenarios as this is a rabbit hole proven to be too deep. This does however fix all scenarios where a new FileInputFormat is created that also uses the LineRecordReader (or a subclass) and forget to implement the correct isSplitable. This is exactly the case that happens if someone were to follow the example from the Yahoo Hadoop tutorial ( https://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat ) to create their own inputformat. This patch makes the job fail when it detects the corruption scenario. I haven't tried it yet but with this patch; if you feed a large Gzipped file into the software from the mentioned tutorial it should now fail the entire job hard the moment it creates a second split. NOTE 1: This patch includes all the JavaDoc improvements created by [~gian]. NOTE 2: To create a unit test I needed to include a small gzipped file in the patch. I hope I did that the right way. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-2094: Status: Open (was: Patch Available) Problem with binary file in the patch file org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-2094: Attachment: MAPREDUCE-2094-20140727-svn.patch Created new patch file via the svn route. In this the binary file is now an ASCII file. This is possible because the content of the file is irrelevant, only the filename and the size ( 3 bytes) are what matters. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-2094: Status: Patch Available (was: Open) This patch no longer contains the binary file. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch, MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers
[ https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035604#comment-14035604 ] Niels Basjes commented on MAPREDUCE-5928: - I changed some of the memory settings and now the job completes successfully. This was with the mapreduce.job.reduce.slowstart.completedmaps at it's default value of 0.05 (5%) Apparently without the blacklisted node it works fine and the fractional memory shenanigans doesn't impact the job. Deadlock allocating containers for mappers and reducers --- Key: MAPREDUCE-5928 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928 Project: Hadoop Map/Reduce Issue Type: Bug Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2) Reporter: Niels Basjes Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully loaded.png.jpg, MR job stuck in deadlock.png.jpg I have a small cluster consisting of 8 desktop class systems (1 master + 7 workers). Due to the small memory of these systems I configured yarn as follows: {quote} yarn.nodemanager.resource.memory-mb = 2200 yarn.scheduler.minimum-allocation-mb = 250 {quote} On my client I did {quote} mapreduce.map.memory.mb = 512 mapreduce.reduce.memory.mb = 512 {quote} Now I run a job with 27 mappers and 32 reducers. After a while I saw this deadlock occur: - All nodes had been filled to their maximum capacity with reducers. - 1 Mapper was waiting for a container slot to start in. I tried killing reducer attempts but that didn't help (new reducer attempts simply took the existing container). *Workaround*: I set this value from my job. The default value is 0.05 (= 5%) {quote} mapreduce.job.reduce.slowstart.completedmaps = 0.99f {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers
[ https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035819#comment-14035819 ] Niels Basjes commented on MAPREDUCE-5928: - Reading through the description of YARN-1680 sure seems like the root cause of my problem. So yes, go ahead and mark this one as a duplicate. Deadlock allocating containers for mappers and reducers --- Key: MAPREDUCE-5928 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928 Project: Hadoop Map/Reduce Issue Type: Bug Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2) Reporter: Niels Basjes Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully loaded.png.jpg, MR job stuck in deadlock.png.jpg I have a small cluster consisting of 8 desktop class systems (1 master + 7 workers). Due to the small memory of these systems I configured yarn as follows: {quote} yarn.nodemanager.resource.memory-mb = 2200 yarn.scheduler.minimum-allocation-mb = 250 {quote} On my client I did {quote} mapreduce.map.memory.mb = 512 mapreduce.reduce.memory.mb = 512 {quote} Now I run a job with 27 mappers and 32 reducers. After a while I saw this deadlock occur: - All nodes had been filled to their maximum capacity with reducers. - 1 Mapper was waiting for a container slot to start in. I tried killing reducer attempts but that didn't help (new reducer attempts simply took the existing container). *Workaround*: I set this value from my job. The default value is 0.05 (= 5%) {quote} mapreduce.job.reduce.slowstart.completedmaps = 0.99f {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers
[ https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033614#comment-14033614 ] Niels Basjes commented on MAPREDUCE-5928: - I have trouble finding the spot where this 500MB per container is defined. Can you give me some input where this is specified? Deadlock allocating containers for mappers and reducers --- Key: MAPREDUCE-5928 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928 Project: Hadoop Map/Reduce Issue Type: Bug Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2) Reporter: Niels Basjes Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully loaded.png.jpg, MR job stuck in deadlock.png.jpg I have a small cluster consisting of 8 desktop class systems (1 master + 7 workers). Due to the small memory of these systems I configured yarn as follows: {quote} yarn.nodemanager.resource.memory-mb = 2200 yarn.scheduler.minimum-allocation-mb = 250 {quote} On my client I did {quote} mapreduce.map.memory.mb = 512 mapreduce.reduce.memory.mb = 512 {quote} Now I run a job with 27 mappers and 32 reducers. After a while I saw this deadlock occur: - All nodes had been filled to their maximum capacity with reducers. - 1 Mapper was waiting for a container slot to start in. I tried killing reducer attempts but that didn't help (new reducer attempts simply took the existing container). *Workaround*: I set this value from my job. The default value is 0.05 (= 5%) {quote} mapreduce.job.reduce.slowstart.completedmaps = 0.99f {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers
Niels Basjes created MAPREDUCE-5928: --- Summary: Deadlock allocating containers for mappers and reducers Key: MAPREDUCE-5928 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928 Project: Hadoop Map/Reduce Issue Type: Bug Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2) Reporter: Niels Basjes I have a small cluster consisting of 8 desktop class systems (1 master + 7 workers). Due to the small memory of these systems I configured yarn as follows: {quote} yarn.nodemanager.resource.memory-mb = 2200 yarn.scheduler.minimum-allocation-mb = 250 {quote} On my client I did {quote} mapreduce.map.memory.mb = 512 mapreduce.reduce.memory.mb = 512 {quote} Now I run a job with 27 mappers and 32 reducers. After a while I saw this deadlock occur: - All nodes had been filled to their maximum capacity with reducers. - 1 Mapper was waiting for a container slot to start in. I tried killing reducer attempts but that didn't help (new reducer attempts simply took the existing container). *Workaround*: I set this value from my job. The default value is 0.05 (= 5%) {quote} mapreduce.job.reduce.slowstart.completedmaps = 0.99f {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers
[ https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-5928: Attachment: MR job stuck in deadlock.png.jpg Deadlock allocating containers for mappers and reducers --- Key: MAPREDUCE-5928 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928 Project: Hadoop Map/Reduce Issue Type: Bug Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2) Reporter: Niels Basjes Attachments: MR job stuck in deadlock.png.jpg I have a small cluster consisting of 8 desktop class systems (1 master + 7 workers). Due to the small memory of these systems I configured yarn as follows: {quote} yarn.nodemanager.resource.memory-mb = 2200 yarn.scheduler.minimum-allocation-mb = 250 {quote} On my client I did {quote} mapreduce.map.memory.mb = 512 mapreduce.reduce.memory.mb = 512 {quote} Now I run a job with 27 mappers and 32 reducers. After a while I saw this deadlock occur: - All nodes had been filled to their maximum capacity with reducers. - 1 Mapper was waiting for a container slot to start in. I tried killing reducer attempts but that didn't help (new reducer attempts simply took the existing container). *Workaround*: I set this value from my job. The default value is 0.05 (= 5%) {quote} mapreduce.job.reduce.slowstart.completedmaps = 0.99f {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers
[ https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-5928: Attachment: Cluster fully loaded.png.jpg NOTE: Node2 had issues so the system took it offline (0 containers). Perhaps this is what confused the MapReduce application? Deadlock allocating containers for mappers and reducers --- Key: MAPREDUCE-5928 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928 Project: Hadoop Map/Reduce Issue Type: Bug Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2) Reporter: Niels Basjes Attachments: Cluster fully loaded.png.jpg, MR job stuck in deadlock.png.jpg I have a small cluster consisting of 8 desktop class systems (1 master + 7 workers). Due to the small memory of these systems I configured yarn as follows: {quote} yarn.nodemanager.resource.memory-mb = 2200 yarn.scheduler.minimum-allocation-mb = 250 {quote} On my client I did {quote} mapreduce.map.memory.mb = 512 mapreduce.reduce.memory.mb = 512 {quote} Now I run a job with 27 mappers and 32 reducers. After a while I saw this deadlock occur: - All nodes had been filled to their maximum capacity with reducers. - 1 Mapper was waiting for a container slot to start in. I tried killing reducer attempts but that didn't help (new reducer attempts simply took the existing container). *Workaround*: I set this value from my job. The default value is 0.05 (= 5%) {quote} mapreduce.job.reduce.slowstart.completedmaps = 0.99f {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers
[ https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032476#comment-14032476 ] Niels Basjes commented on MAPREDUCE-5928: - I'm not the only one who ran into this: http://hortonworks.com/community/forums/topic/mapreduce-race-condition-big-job/ Deadlock allocating containers for mappers and reducers --- Key: MAPREDUCE-5928 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928 Project: Hadoop Map/Reduce Issue Type: Bug Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2) Reporter: Niels Basjes Attachments: Cluster fully loaded.png.jpg, MR job stuck in deadlock.png.jpg I have a small cluster consisting of 8 desktop class systems (1 master + 7 workers). Due to the small memory of these systems I configured yarn as follows: {quote} yarn.nodemanager.resource.memory-mb = 2200 yarn.scheduler.minimum-allocation-mb = 250 {quote} On my client I did {quote} mapreduce.map.memory.mb = 512 mapreduce.reduce.memory.mb = 512 {quote} Now I run a job with 27 mappers and 32 reducers. After a while I saw this deadlock occur: - All nodes had been filled to their maximum capacity with reducers. - 1 Mapper was waiting for a container slot to start in. I tried killing reducer attempts but that didn't help (new reducer attempts simply took the existing container). *Workaround*: I set this value from my job. The default value is 0.05 (= 5%) {quote} mapreduce.job.reduce.slowstart.completedmaps = 0.99f {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers
[ https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-5928: Attachment: AM-MR-syslog - Cleaned.txt.gz I downloaded the Application Master log and attached it to this issue. (I changed the domainname of the nodes) Deadlock allocating containers for mappers and reducers --- Key: MAPREDUCE-5928 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928 Project: Hadoop Map/Reduce Issue Type: Bug Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2) Reporter: Niels Basjes Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully loaded.png.jpg, MR job stuck in deadlock.png.jpg I have a small cluster consisting of 8 desktop class systems (1 master + 7 workers). Due to the small memory of these systems I configured yarn as follows: {quote} yarn.nodemanager.resource.memory-mb = 2200 yarn.scheduler.minimum-allocation-mb = 250 {quote} On my client I did {quote} mapreduce.map.memory.mb = 512 mapreduce.reduce.memory.mb = 512 {quote} Now I run a job with 27 mappers and 32 reducers. After a while I saw this deadlock occur: - All nodes had been filled to their maximum capacity with reducers. - 1 Mapper was waiting for a container slot to start in. I tried killing reducer attempts but that didn't help (new reducer attempts simply took the existing container). *Workaround*: I set this value from my job. The default value is 0.05 (= 5%) {quote} mapreduce.job.reduce.slowstart.completedmaps = 0.99f {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers
[ https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032516#comment-14032516 ] Niels Basjes commented on MAPREDUCE-5928: - I have not actively configured any scheduling. So I guess it is running the 'default' setting ? Deadlock allocating containers for mappers and reducers --- Key: MAPREDUCE-5928 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928 Project: Hadoop Map/Reduce Issue Type: Bug Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2) Reporter: Niels Basjes Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully loaded.png.jpg, MR job stuck in deadlock.png.jpg I have a small cluster consisting of 8 desktop class systems (1 master + 7 workers). Due to the small memory of these systems I configured yarn as follows: {quote} yarn.nodemanager.resource.memory-mb = 2200 yarn.scheduler.minimum-allocation-mb = 250 {quote} On my client I did {quote} mapreduce.map.memory.mb = 512 mapreduce.reduce.memory.mb = 512 {quote} Now I run a job with 27 mappers and 32 reducers. After a while I saw this deadlock occur: - All nodes had been filled to their maximum capacity with reducers. - 1 Mapper was waiting for a container slot to start in. I tried killing reducer attempts but that didn't help (new reducer attempts simply took the existing container). *Workaround*: I set this value from my job. The default value is 0.05 (= 5%) {quote} mapreduce.job.reduce.slowstart.completedmaps = 0.99f {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers
[ https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032564#comment-14032564 ] Niels Basjes commented on MAPREDUCE-5928: - I took the 'dead' node (node2) offline (completely stopped all hadoop/yarn related deamons) and ran the same job again after it had disappeared in all overviews. Now it does complete all mappers. Deadlock allocating containers for mappers and reducers --- Key: MAPREDUCE-5928 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928 Project: Hadoop Map/Reduce Issue Type: Bug Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2) Reporter: Niels Basjes Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully loaded.png.jpg, MR job stuck in deadlock.png.jpg I have a small cluster consisting of 8 desktop class systems (1 master + 7 workers). Due to the small memory of these systems I configured yarn as follows: {quote} yarn.nodemanager.resource.memory-mb = 2200 yarn.scheduler.minimum-allocation-mb = 250 {quote} On my client I did {quote} mapreduce.map.memory.mb = 512 mapreduce.reduce.memory.mb = 512 {quote} Now I run a job with 27 mappers and 32 reducers. After a while I saw this deadlock occur: - All nodes had been filled to their maximum capacity with reducers. - 1 Mapper was waiting for a container slot to start in. I tried killing reducer attempts but that didn't help (new reducer attempts simply took the existing container). *Workaround*: I set this value from my job. The default value is 0.05 (= 5%) {quote} mapreduce.job.reduce.slowstart.completedmaps = 0.99f {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers
[ https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032766#comment-14032766 ] Niels Basjes commented on MAPREDUCE-5928: - Where/how can I determine for sure if the capacity scheduler is used? Deadlock allocating containers for mappers and reducers --- Key: MAPREDUCE-5928 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928 Project: Hadoop Map/Reduce Issue Type: Bug Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2) Reporter: Niels Basjes Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully loaded.png.jpg, MR job stuck in deadlock.png.jpg I have a small cluster consisting of 8 desktop class systems (1 master + 7 workers). Due to the small memory of these systems I configured yarn as follows: {quote} yarn.nodemanager.resource.memory-mb = 2200 yarn.scheduler.minimum-allocation-mb = 250 {quote} On my client I did {quote} mapreduce.map.memory.mb = 512 mapreduce.reduce.memory.mb = 512 {quote} Now I run a job with 27 mappers and 32 reducers. After a while I saw this deadlock occur: - All nodes had been filled to their maximum capacity with reducers. - 1 Mapper was waiting for a container slot to start in. I tried killing reducer attempts but that didn't help (new reducer attempts simply took the existing container). *Workaround*: I set this value from my job. The default value is 0.05 (= 5%) {quote} mapreduce.job.reduce.slowstart.completedmaps = 0.99f {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers
[ https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032831#comment-14032831 ] Niels Basjes commented on MAPREDUCE-5928: - Confirmed It is using the CapacityScheduler: {code} property nameyarn.resourcemanager.scheduler.class/name valueorg.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler/value sourceyarn-default.xml/source /property {code} I'm going to fiddle with the memory setting tomorrow. Deadlock allocating containers for mappers and reducers --- Key: MAPREDUCE-5928 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928 Project: Hadoop Map/Reduce Issue Type: Bug Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2) Reporter: Niels Basjes Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully loaded.png.jpg, MR job stuck in deadlock.png.jpg I have a small cluster consisting of 8 desktop class systems (1 master + 7 workers). Due to the small memory of these systems I configured yarn as follows: {quote} yarn.nodemanager.resource.memory-mb = 2200 yarn.scheduler.minimum-allocation-mb = 250 {quote} On my client I did {quote} mapreduce.map.memory.mb = 512 mapreduce.reduce.memory.mb = 512 {quote} Now I run a job with 27 mappers and 32 reducers. After a while I saw this deadlock occur: - All nodes had been filled to their maximum capacity with reducers. - 1 Mapper was waiting for a container slot to start in. I tried killing reducer attempts but that didn't help (new reducer attempts simply took the existing container). *Workaround*: I set this value from my job. The default value is 0.05 (= 5%) {quote} mapreduce.job.reduce.slowstart.completedmaps = 0.99f {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MAPREDUCE-5925) NLineInputFormat silently produces garbage on gzipped input
Niels Basjes created MAPREDUCE-5925: --- Summary: NLineInputFormat silently produces garbage on gzipped input Key: MAPREDUCE-5925 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5925 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Niels Basjes Priority: Critical [ Found while investigating the impact of MAPREDUCE-2094 ] The org.apache.hadoop.mapreduce.lib.input.NLineInputFormat (probably the mapred version too) only makes sense for splittable files. This inputformat uses the isSplitable from its superclass FileInputFormat (which always returns true) in combination with the LineRecordReader. When you provide it a gzipped file (non-splittable compression) it will create multiple splits (isSplitable == true) yet the LineRecordReader cannot handle the gzipped file in multiple splits because the GzipCodec does not support this. Overall effect is that you get incorrect results. Proposed solution: Add detection for this kind of scenario and let the NLineInputFormat fail hard when someone tries this. I'm not sure if this should go into the LineRecordReader or only in the NLineInputFormat. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-346) Report Map-Reduce Framework Counters in pipeline order
[ https://issues.apache.org/jira/browse/MAPREDUCE-346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540994#comment-13540994 ] Niels Basjes commented on MAPREDUCE-346: I just ran the wordcount example from the current trunk and the output looks like this: {code} File System Counters FILE: Number of bytes read=1355162 FILE: Number of bytes written=4112351 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=1082 Map output records=3884 Map output bytes=53040 Map output materialized bytes=42292 Input split bytes=2785 Combine input records=3884 Combine output records=2263 Reduce input groups=990 Reduce shuffle bytes=0 Reduce input records=2263 Reduce output records=990 Spilled Records=5299 Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=677 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 Total committed heap usage (bytes)=8783134720 File Input Format Counters Bytes Read=38869 File Output Format Counters Bytes Written=21444 {code} To me the order of a few items in this list appear out of place: I would expect Shuffled Maps, Failed Shuffles and Merged Map outputs somewhere between the map and the reduce instead of after the reduce. Other than that (and the trailing space at the end of Shuffled Maps ) I propose we close this improvement as Fixed. Report Map-Reduce Framework Counters in pipeline order -- Key: MAPREDUCE-346 URL: https://issues.apache.org/jira/browse/MAPREDUCE-346 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Sharad Agarwal Assignee: Sharad Agarwal Attachments: 5216_v1.patch Currently there is no order in which counters are printed. It would be more user friendly if Map-Reduce Framework counters are reported in the pipeline order. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-2094: Attachment: MAPREDUCE-2094-2011-05-19.patch I've created a patch that in my mind fixes this issue the correct way. If this is really the case if very open to discussion. There are basically 4 current situations: # People use the existing FileInputFormat derivatives. This patch ensures that all of those that are present in the existing code base (including the examples) still retain the same behavior as before. # People have created a new derivative and HAVE overridden isSplitable with something that fits their needs. This patch does not change those situations. # People have created a new derivative and have *NOT* overridden isSplitable. ## If their input is in a splittable form (like LZO or uncompressed); then this patch will not affect them ## In the situation where they have big non-splittable input files (like gzip files) they will have ran into unexpected errors. Possibly they will spent a lot of time looking for performance issues and wrong results in production that did not occur during unit testing (we did!). This patch will fix this problem without any code changes in their code base. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 0.20.1, 0.20.2, 0.21.0 Reporter: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-2094: Assignee: Niels Basjes Affects Version/s: (was: 0.20.2) (was: 0.20.1) (was: 0.21.0) Release Note: Fixed splitting errors present in many FileInputFormat derivatives that do not override isSplitable. Status: Patch Available (was: Open) I've not marked this as an Incompatible change because the behaviour is only changed in the situations where there was an error condition. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Reporter: Niels Basjes Assignee: Niels Basjes Attachments: MAPREDUCE-2094-2011-05-19.patch When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918851#action_12918851 ] Niels Basjes commented on MAPREDUCE-2094: - I just noticed that the Yahoo Hadoop tutorial [Module 5: Advanced MapReduce Features |http://developer.yahoo.com/hadoop/tutorial/module5.html]; shows a code example for defining your own [FileInputFormat|http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat]. The shown example code implements a derivative using FileInputFormat and LineRecordReader without overruling isSplittable ... I expect this tutorial code to lead people into this bug. Since this bug will only become apparent when using large non splittable (gzipped) input files it is also important to notice that almost no one will have a (unit) test that will trip on this bug. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 0.20.1, 0.20.2, 0.21.0 Reporter: Niels Basjes When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916076#action_12916076 ] Niels Basjes commented on MAPREDUCE-2094: - bq. Changing the default implementation to return false would be an incompatible change, potentially breaking existing subclasses. If you mean with breaking : Some subclasses will see an unexpected performance degradation . then Yes, that will most likely occur ( first one I think of is SequenceFileInputFormat ). I however do not expect any functional breaking of the output of these subclasses. bq. Making the method abstract would also be incompatible and break subclasses, but in a way that they'd easily detect. Yes. The downside of this option is that if subclasses want to have detection depending on the compression I expect a lot of code duplication to occur. This code duplication is already occurring within the main code base in KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat and their old API counterparts (I found a total of 5 duplicates of the same isSplittable implementation). bq. Perhaps the javadoc should just be clarified to better document this default? Definitely an option. However this would not fix the effect in the existing subclasses. I just did a quick manual code check of the current trunk and I found that the following classes are derived from FileInputFormat yet do not implement the isSplittable method (and thus use return true). * ./src/java/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.java * ./src/contrib/streaming/src/java/org/apache/hadoop/streaming/AutoInputFormat.java * ./src/java/org/apache/hadoop/mapred/SequenceFileInputFormat.java I expect that the NLineInputFormat and the AutoInputFormat will affected by this large gzip bug. So expect that simply fixing the isSplittable documentation would lead to the need to fix *at least* these two classes. As far as I understand the SequenceFileInputFormat can only be compressed using a splittable compression, so the return true; from FileInputFormat will work fine there. Overall I still prefer the clean option of returning the correct value depending on the compression. That would effectively leave the behavior in most use cases unchanged. Yet in those cases where splitting is known to cause problems it would avoid those problems. Thus avoiding major issues like the ones we had and described in HADOOP-6901. For the SequenceFileInputFormat it may be needed to implement the isSplittable as return true; Effectively the set of changes I propose (in both the old and new API versions of these classes): 1) FileInputFormat . isSplittable gets the implementation as seen in TextInputFormat 2) The isSplittable implementation is removed from KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat (useless code duplication) 3) The isSplittable implementation return true is added to SequenceFileInputFormat. Given the fact that you cannot gzip a sequencefile I expect this to be an optional fix. org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 0.20.1, 0.20.2, 0.21.0 Reporter: Niels Basjes When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order
[jira] Created: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Improvement Components: task Affects Versions: 0.21.0, 0.20.2, 0.20.1 Reporter: Niels Basjes When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradcition with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable inour class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method (and therfor the entire FileInputFormat class) abstract. # Use a safe default (i.e. return false) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-2094: Issue Type: Bug (was: Improvement) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 0.20.1, 0.20.2, 0.21.0 Reporter: Niels Basjes When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradcition with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable inour class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method (and therfor the entire FileInputFormat class) abstract. # Use a safe default (i.e. return false) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.
[ https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes updated MAPREDUCE-2094: Description: When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable in our class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method abstract. # Use a safe default (i.e. return false) was: When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradcition with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. For our situation (where we always have Gzipped input) we took the easy way out and simply implemented an isSplittable inour class that does return false; Now there are essentially 3 ways I can think of for fixing this (in order of what I would find preferable): # Implement something that looks at the used compression of the file (i.e. do migrate the implementation from TextInputFormat to FileInputFormat). This would make the method do what the JavaDoc describes. # Force developers to think about it and make this method (and therfor the entire FileInputFormat class) abstract. # Use a safe default (i.e. return false) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour. --- Key: MAPREDUCE-2094 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 0.20.1, 0.20.2, 0.21.0 Reporter: Niels Basjes When implementing a custom derivative of FileInputFormat we ran into the effect that a large Gzipped input file would be processed several times. A near 1GiB file would be processed around 36 times in its entirety. Thus producing garbage results and taking up a lot more CPU time than needed. It took a while to figure out and what we found is that the default implementation of the isSplittable method in [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup ] is simply return true;. This is a very unsafe default and is in contradiction with the JavaDoc of the method which states: Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be. . The actual implementation effectively does Is the given filename splitable? Always true, even if the file is stream compressed using an unsplittable compression codec. For our situation (where we always have Gzipped input) we took the easy way out and