[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2015-05-06 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531362#comment-14531362
 ] 

Niels Basjes commented on MAPREDUCE-2094:
-

I understand you want the error message to be 'clean'. Normally I would do that 
too.
This message can however only appear if you are using (or have been using for a 
long time)  a (usually custom) FileInputFormat that has been corrupting your 
results (for perhaps even years ... note I created this bug report about 4.5 
years ago).
I think it is important to clarify the impact of the problem the author of the 
custom code introduced themselves so they immediately understand what went 
wrong.
And yes... this message is bit over the top ...

Atleast let's make the message easier to understand what went wrong and make 
aware of the historical implications.
How about {{A split was attempted for a file that is being decompressed by  + 
codec.getClass().getSimpleName() +  which does not support splitting. Note 
that this would have corrupted the data in older Hadoop versions.}}

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Niels Basjes
Assignee: Niels Basjes
  Labels: BB2015-05-TBR
 Attachments: MAPREDUCE-2094-2011-05-19.patch, 
 MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, 
 MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, 
 MAPREDUCE-2094-2015-05-05-2328.patch, 
 MAPREDUCE-2094-FileInputFormat-docs-v2.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2015-05-06 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531367#comment-14531367
 ] 

Niels Basjes commented on MAPREDUCE-2094:
-

P.S. I'm unable to provide an updated patch for the next week or so. So please 
make up a good message and so people can start catching this problem.

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Niels Basjes
Assignee: Niels Basjes
  Labels: BB2015-05-TBR
 Attachments: MAPREDUCE-2094-2011-05-19.patch, 
 MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, 
 MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, 
 MAPREDUCE-2094-2015-05-05-2328.patch, 
 MAPREDUCE-2094-FileInputFormat-docs-v2.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2015-05-05 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-2094:

Attachment: MAPREDUCE-2094-20140727-svn-fixed-spaces.patch

I removed the trailing spaces from the lines I touched.
One line that was reported however is a line I didn't touch in my patch.

Question: What should I do about that?

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Niels Basjes
Assignee: Niels Basjes
 Attachments: MAPREDUCE-2094-2011-05-19.patch, 
 MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, 
 MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, 
 MAPREDUCE-2094-FileInputFormat-docs-v2.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2015-05-05 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-2094:

Status: Open  (was: Patch Available)

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Niels Basjes
Assignee: Niels Basjes
 Attachments: MAPREDUCE-2094-2011-05-19.patch, 
 MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, 
 MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, 
 MAPREDUCE-2094-FileInputFormat-docs-v2.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2015-05-05 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-2094:

Status: Patch Available  (was: Open)

Removed the trailing spaces from the lines I touched.

@[~aw]: Apparently the trailing spaces check is also triggered by the trailing 
space in one of the 'surrounding' lines in the patch file. How should this be 
handled?

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Niels Basjes
Assignee: Niels Basjes
 Attachments: MAPREDUCE-2094-2011-05-19.patch, 
 MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, 
 MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, 
 MAPREDUCE-2094-FileInputFormat-docs-v2.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2015-05-05 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-2094:

Attachment: MAPREDUCE-2094-2015-05-05-2328.patch

This should work

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Niels Basjes
Assignee: Niels Basjes
 Attachments: MAPREDUCE-2094-2011-05-19.patch, 
 MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, 
 MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, 
 MAPREDUCE-2094-2015-05-05-2328.patch, 
 MAPREDUCE-2094-FileInputFormat-docs-v2.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2015-05-05 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-2094:

Status: Patch Available  (was: Open)

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Niels Basjes
Assignee: Niels Basjes
 Attachments: MAPREDUCE-2094-2011-05-19.patch, 
 MAPREDUCE-2094-20140727-svn-fixed-spaces.patch, 
 MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, 
 MAPREDUCE-2094-2015-05-05-2328.patch, 
 MAPREDUCE-2094-FileInputFormat-docs-v2.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2014-07-27 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075606#comment-14075606
 ] 

Niels Basjes commented on MAPREDUCE-2094:
-

Question for [~gian], [~ggoodson] and others who have faced this problem.
Did you use the 'out of the box' LineRecordReader (or a subclass of that) as 
shown in https://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat 
or did you write something 'completely new'?

When I ran into this I followed the tutorial.

I think the 'next best' spot to stop most problem scenarios (i.e. everyone who 
implements it like the tutorial shows) can be caught by letting the 
LineRecordReader fail the entire job when it is initialized with non splittable 
codec file and the provided split is not the entire file. 

What do you think?

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Niels Basjes
Assignee: Niels Basjes
 Attachments: MAPREDUCE-2094-2011-05-19.patch, 
 MAPREDUCE-2094-FileInputFormat-docs-v2.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2014-07-27 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-2094:


Attachment: MAPREDUCE-2094-20140727.patch

This is the patch I created to throw an exception in the most common case where 
the described problem occurs.
So this is a Fail hard when things already went wrong patch.

NOTE 1: This patch includes all the JavaDoc improvements created by [~gian].
NOTE 2: To create a unit test I needed to include a small gzipped file in the 
patch. I hope I did that the right way.

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Niels Basjes
Assignee: Niels Basjes
 Attachments: MAPREDUCE-2094-2011-05-19.patch, 
 MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2014-07-27 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-2094:


Release Note: Throw an Exception in the most common error scenario present 
in many FileInputFormat derivatives that do not override isSplitable.   (was: 
Fixed splitting errors present in many FileInputFormat derivatives that do not 
override isSplitable. )
  Status: Patch Available  (was: Open)

This Fix does not solve the problem in all scenarios as this is a rabbit hole 
proven to be too deep.

This does however fix all scenarios where a new FileInputFormat is created that 
also uses the LineRecordReader (or a subclass) and forget to implement the 
correct isSplitable. This is exactly the case that happens if someone were to 
follow the example from the Yahoo Hadoop tutorial ( 
https://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat ) to create 
their own inputformat.

This patch makes the job fail when it detects the corruption scenario.

I haven't tried it yet but with this patch; if you feed a large Gzipped file 
into the software from the mentioned tutorial it should now fail the entire job 
hard the moment it creates a second split. 

NOTE 1: This patch includes all the JavaDoc improvements created by [~gian].
NOTE 2: To create a unit test I needed to include a small gzipped file in the 
patch. I hope I did that the right way.

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Niels Basjes
Assignee: Niels Basjes
 Attachments: MAPREDUCE-2094-2011-05-19.patch, 
 MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2014-07-27 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-2094:


Status: Open  (was: Patch Available)

Problem with binary file in the patch file

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Niels Basjes
Assignee: Niels Basjes
 Attachments: MAPREDUCE-2094-2011-05-19.patch, 
 MAPREDUCE-2094-20140727.patch, MAPREDUCE-2094-FileInputFormat-docs-v2.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2014-07-27 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-2094:


Attachment: MAPREDUCE-2094-20140727-svn.patch

Created new patch file via the svn route.
In this the binary file is now an ASCII file. This is possible because the 
content of the file is irrelevant, only the filename and the size ( 3 bytes) 
are what matters.

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Niels Basjes
Assignee: Niels Basjes
 Attachments: MAPREDUCE-2094-2011-05-19.patch, 
 MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, 
 MAPREDUCE-2094-FileInputFormat-docs-v2.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2014-07-27 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-2094:


Status: Patch Available  (was: Open)

This patch no longer contains the binary file.

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Niels Basjes
Assignee: Niels Basjes
 Attachments: MAPREDUCE-2094-2011-05-19.patch, 
 MAPREDUCE-2094-20140727-svn.patch, MAPREDUCE-2094-20140727.patch, 
 MAPREDUCE-2094-FileInputFormat-docs-v2.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers

2014-06-18 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035604#comment-14035604
 ] 

Niels Basjes commented on MAPREDUCE-5928:
-

I changed some of the memory settings and now the job completes successfully.
This was with the mapreduce.job.reduce.slowstart.completedmaps at it's default 
value of 0.05 (5%)
Apparently without the blacklisted node it works fine and the fractional memory 
shenanigans doesn't impact the job.

 Deadlock allocating containers for mappers and reducers
 ---

 Key: MAPREDUCE-5928
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928
 Project: Hadoop Map/Reduce
  Issue Type: Bug
 Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2)
Reporter: Niels Basjes
 Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully 
 loaded.png.jpg, MR job stuck in deadlock.png.jpg


 I have a small cluster consisting of 8 desktop class systems (1 master + 7 
 workers).
 Due to the small memory of these systems I configured yarn as follows:
 {quote}
 yarn.nodemanager.resource.memory-mb = 2200
 yarn.scheduler.minimum-allocation-mb = 250
 {quote}
 On my client I did
 {quote}
 mapreduce.map.memory.mb = 512
 mapreduce.reduce.memory.mb = 512
 {quote}
 Now I run a job with 27 mappers and 32 reducers.
 After a while I saw this deadlock occur:
 - All nodes had been filled to their maximum capacity with reducers.
 - 1 Mapper was waiting for a container slot to start in.
 I tried killing reducer attempts but that didn't help (new reducer attempts 
 simply took the existing container).
 *Workaround*:
 I set this value from my job. The default value is 0.05 (= 5%)
 {quote}
 mapreduce.job.reduce.slowstart.completedmaps = 0.99f
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers

2014-06-18 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035819#comment-14035819
 ] 

Niels Basjes commented on MAPREDUCE-5928:
-

Reading through the description of YARN-1680 sure seems like the root cause of 
my problem.
So yes, go ahead and mark this one as a duplicate.

 Deadlock allocating containers for mappers and reducers
 ---

 Key: MAPREDUCE-5928
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928
 Project: Hadoop Map/Reduce
  Issue Type: Bug
 Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2)
Reporter: Niels Basjes
 Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully 
 loaded.png.jpg, MR job stuck in deadlock.png.jpg


 I have a small cluster consisting of 8 desktop class systems (1 master + 7 
 workers).
 Due to the small memory of these systems I configured yarn as follows:
 {quote}
 yarn.nodemanager.resource.memory-mb = 2200
 yarn.scheduler.minimum-allocation-mb = 250
 {quote}
 On my client I did
 {quote}
 mapreduce.map.memory.mb = 512
 mapreduce.reduce.memory.mb = 512
 {quote}
 Now I run a job with 27 mappers and 32 reducers.
 After a while I saw this deadlock occur:
 - All nodes had been filled to their maximum capacity with reducers.
 - 1 Mapper was waiting for a container slot to start in.
 I tried killing reducer attempts but that didn't help (new reducer attempts 
 simply took the existing container).
 *Workaround*:
 I set this value from my job. The default value is 0.05 (= 5%)
 {quote}
 mapreduce.job.reduce.slowstart.completedmaps = 0.99f
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers

2014-06-17 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033614#comment-14033614
 ] 

Niels Basjes commented on MAPREDUCE-5928:
-

I have trouble finding the spot where this 500MB per container is defined.
Can you give me some input where this is specified?


 Deadlock allocating containers for mappers and reducers
 ---

 Key: MAPREDUCE-5928
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928
 Project: Hadoop Map/Reduce
  Issue Type: Bug
 Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2)
Reporter: Niels Basjes
 Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully 
 loaded.png.jpg, MR job stuck in deadlock.png.jpg


 I have a small cluster consisting of 8 desktop class systems (1 master + 7 
 workers).
 Due to the small memory of these systems I configured yarn as follows:
 {quote}
 yarn.nodemanager.resource.memory-mb = 2200
 yarn.scheduler.minimum-allocation-mb = 250
 {quote}
 On my client I did
 {quote}
 mapreduce.map.memory.mb = 512
 mapreduce.reduce.memory.mb = 512
 {quote}
 Now I run a job with 27 mappers and 32 reducers.
 After a while I saw this deadlock occur:
 - All nodes had been filled to their maximum capacity with reducers.
 - 1 Mapper was waiting for a container slot to start in.
 I tried killing reducer attempts but that didn't help (new reducer attempts 
 simply took the existing container).
 *Workaround*:
 I set this value from my job. The default value is 0.05 (= 5%)
 {quote}
 mapreduce.job.reduce.slowstart.completedmaps = 0.99f
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers

2014-06-16 Thread Niels Basjes (JIRA)
Niels Basjes created MAPREDUCE-5928:
---

 Summary: Deadlock allocating containers for mappers and reducers
 Key: MAPREDUCE-5928
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928
 Project: Hadoop Map/Reduce
  Issue Type: Bug
 Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2)
Reporter: Niels Basjes


I have a small cluster consisting of 8 desktop class systems (1 master + 7 
workers).
Due to the small memory of these systems I configured yarn as follows:
{quote}
yarn.nodemanager.resource.memory-mb = 2200
yarn.scheduler.minimum-allocation-mb = 250
{quote}
On my client I did
{quote}
mapreduce.map.memory.mb = 512
mapreduce.reduce.memory.mb = 512
{quote}
Now I run a job with 27 mappers and 32 reducers.
After a while I saw this deadlock occur:
-   All nodes had been filled to their maximum capacity with reducers.
-   1 Mapper was waiting for a container slot to start in.

I tried killing reducer attempts but that didn't help (new reducer attempts 
simply took the existing container).

*Workaround*:
I set this value from my job. The default value is 0.05 (= 5%)
{quote}
mapreduce.job.reduce.slowstart.completedmaps = 0.99f
{quote}




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers

2014-06-16 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-5928:


Attachment: MR job stuck in deadlock.png.jpg

 Deadlock allocating containers for mappers and reducers
 ---

 Key: MAPREDUCE-5928
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928
 Project: Hadoop Map/Reduce
  Issue Type: Bug
 Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2)
Reporter: Niels Basjes
 Attachments: MR job stuck in deadlock.png.jpg


 I have a small cluster consisting of 8 desktop class systems (1 master + 7 
 workers).
 Due to the small memory of these systems I configured yarn as follows:
 {quote}
 yarn.nodemanager.resource.memory-mb = 2200
 yarn.scheduler.minimum-allocation-mb = 250
 {quote}
 On my client I did
 {quote}
 mapreduce.map.memory.mb = 512
 mapreduce.reduce.memory.mb = 512
 {quote}
 Now I run a job with 27 mappers and 32 reducers.
 After a while I saw this deadlock occur:
 - All nodes had been filled to their maximum capacity with reducers.
 - 1 Mapper was waiting for a container slot to start in.
 I tried killing reducer attempts but that didn't help (new reducer attempts 
 simply took the existing container).
 *Workaround*:
 I set this value from my job. The default value is 0.05 (= 5%)
 {quote}
 mapreduce.job.reduce.slowstart.completedmaps = 0.99f
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers

2014-06-16 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-5928:


Attachment: Cluster fully loaded.png.jpg

NOTE: Node2 had issues so the system took it offline (0 containers). 
Perhaps this is what confused the MapReduce application?

 Deadlock allocating containers for mappers and reducers
 ---

 Key: MAPREDUCE-5928
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928
 Project: Hadoop Map/Reduce
  Issue Type: Bug
 Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2)
Reporter: Niels Basjes
 Attachments: Cluster fully loaded.png.jpg, MR job stuck in 
 deadlock.png.jpg


 I have a small cluster consisting of 8 desktop class systems (1 master + 7 
 workers).
 Due to the small memory of these systems I configured yarn as follows:
 {quote}
 yarn.nodemanager.resource.memory-mb = 2200
 yarn.scheduler.minimum-allocation-mb = 250
 {quote}
 On my client I did
 {quote}
 mapreduce.map.memory.mb = 512
 mapreduce.reduce.memory.mb = 512
 {quote}
 Now I run a job with 27 mappers and 32 reducers.
 After a while I saw this deadlock occur:
 - All nodes had been filled to their maximum capacity with reducers.
 - 1 Mapper was waiting for a container slot to start in.
 I tried killing reducer attempts but that didn't help (new reducer attempts 
 simply took the existing container).
 *Workaround*:
 I set this value from my job. The default value is 0.05 (= 5%)
 {quote}
 mapreduce.job.reduce.slowstart.completedmaps = 0.99f
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers

2014-06-16 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032476#comment-14032476
 ] 

Niels Basjes commented on MAPREDUCE-5928:
-

I'm not the only one who ran into this: 
http://hortonworks.com/community/forums/topic/mapreduce-race-condition-big-job/

 Deadlock allocating containers for mappers and reducers
 ---

 Key: MAPREDUCE-5928
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928
 Project: Hadoop Map/Reduce
  Issue Type: Bug
 Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2)
Reporter: Niels Basjes
 Attachments: Cluster fully loaded.png.jpg, MR job stuck in 
 deadlock.png.jpg


 I have a small cluster consisting of 8 desktop class systems (1 master + 7 
 workers).
 Due to the small memory of these systems I configured yarn as follows:
 {quote}
 yarn.nodemanager.resource.memory-mb = 2200
 yarn.scheduler.minimum-allocation-mb = 250
 {quote}
 On my client I did
 {quote}
 mapreduce.map.memory.mb = 512
 mapreduce.reduce.memory.mb = 512
 {quote}
 Now I run a job with 27 mappers and 32 reducers.
 After a while I saw this deadlock occur:
 - All nodes had been filled to their maximum capacity with reducers.
 - 1 Mapper was waiting for a container slot to start in.
 I tried killing reducer attempts but that didn't help (new reducer attempts 
 simply took the existing container).
 *Workaround*:
 I set this value from my job. The default value is 0.05 (= 5%)
 {quote}
 mapreduce.job.reduce.slowstart.completedmaps = 0.99f
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers

2014-06-16 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-5928:


Attachment: AM-MR-syslog - Cleaned.txt.gz

I downloaded the Application Master log and attached it to this issue. (I 
changed the domainname of the nodes) 

 Deadlock allocating containers for mappers and reducers
 ---

 Key: MAPREDUCE-5928
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928
 Project: Hadoop Map/Reduce
  Issue Type: Bug
 Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2)
Reporter: Niels Basjes
 Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully 
 loaded.png.jpg, MR job stuck in deadlock.png.jpg


 I have a small cluster consisting of 8 desktop class systems (1 master + 7 
 workers).
 Due to the small memory of these systems I configured yarn as follows:
 {quote}
 yarn.nodemanager.resource.memory-mb = 2200
 yarn.scheduler.minimum-allocation-mb = 250
 {quote}
 On my client I did
 {quote}
 mapreduce.map.memory.mb = 512
 mapreduce.reduce.memory.mb = 512
 {quote}
 Now I run a job with 27 mappers and 32 reducers.
 After a while I saw this deadlock occur:
 - All nodes had been filled to their maximum capacity with reducers.
 - 1 Mapper was waiting for a container slot to start in.
 I tried killing reducer attempts but that didn't help (new reducer attempts 
 simply took the existing container).
 *Workaround*:
 I set this value from my job. The default value is 0.05 (= 5%)
 {quote}
 mapreduce.job.reduce.slowstart.completedmaps = 0.99f
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers

2014-06-16 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032516#comment-14032516
 ] 

Niels Basjes commented on MAPREDUCE-5928:
-

I have not actively configured any scheduling.
So I guess it is running the 'default' setting ?


 Deadlock allocating containers for mappers and reducers
 ---

 Key: MAPREDUCE-5928
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928
 Project: Hadoop Map/Reduce
  Issue Type: Bug
 Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2)
Reporter: Niels Basjes
 Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully 
 loaded.png.jpg, MR job stuck in deadlock.png.jpg


 I have a small cluster consisting of 8 desktop class systems (1 master + 7 
 workers).
 Due to the small memory of these systems I configured yarn as follows:
 {quote}
 yarn.nodemanager.resource.memory-mb = 2200
 yarn.scheduler.minimum-allocation-mb = 250
 {quote}
 On my client I did
 {quote}
 mapreduce.map.memory.mb = 512
 mapreduce.reduce.memory.mb = 512
 {quote}
 Now I run a job with 27 mappers and 32 reducers.
 After a while I saw this deadlock occur:
 - All nodes had been filled to their maximum capacity with reducers.
 - 1 Mapper was waiting for a container slot to start in.
 I tried killing reducer attempts but that didn't help (new reducer attempts 
 simply took the existing container).
 *Workaround*:
 I set this value from my job. The default value is 0.05 (= 5%)
 {quote}
 mapreduce.job.reduce.slowstart.completedmaps = 0.99f
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers

2014-06-16 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032564#comment-14032564
 ] 

Niels Basjes commented on MAPREDUCE-5928:
-

I took the 'dead' node (node2) offline (completely stopped all hadoop/yarn 
related deamons) and ran the same job again after it had disappeared in all 
overviews.
Now it does complete all mappers.

 Deadlock allocating containers for mappers and reducers
 ---

 Key: MAPREDUCE-5928
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928
 Project: Hadoop Map/Reduce
  Issue Type: Bug
 Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2)
Reporter: Niels Basjes
 Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully 
 loaded.png.jpg, MR job stuck in deadlock.png.jpg


 I have a small cluster consisting of 8 desktop class systems (1 master + 7 
 workers).
 Due to the small memory of these systems I configured yarn as follows:
 {quote}
 yarn.nodemanager.resource.memory-mb = 2200
 yarn.scheduler.minimum-allocation-mb = 250
 {quote}
 On my client I did
 {quote}
 mapreduce.map.memory.mb = 512
 mapreduce.reduce.memory.mb = 512
 {quote}
 Now I run a job with 27 mappers and 32 reducers.
 After a while I saw this deadlock occur:
 - All nodes had been filled to their maximum capacity with reducers.
 - 1 Mapper was waiting for a container slot to start in.
 I tried killing reducer attempts but that didn't help (new reducer attempts 
 simply took the existing container).
 *Workaround*:
 I set this value from my job. The default value is 0.05 (= 5%)
 {quote}
 mapreduce.job.reduce.slowstart.completedmaps = 0.99f
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers

2014-06-16 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032766#comment-14032766
 ] 

Niels Basjes commented on MAPREDUCE-5928:
-

Where/how can I determine for sure if the capacity scheduler is used?

 Deadlock allocating containers for mappers and reducers
 ---

 Key: MAPREDUCE-5928
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928
 Project: Hadoop Map/Reduce
  Issue Type: Bug
 Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2)
Reporter: Niels Basjes
 Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully 
 loaded.png.jpg, MR job stuck in deadlock.png.jpg


 I have a small cluster consisting of 8 desktop class systems (1 master + 7 
 workers).
 Due to the small memory of these systems I configured yarn as follows:
 {quote}
 yarn.nodemanager.resource.memory-mb = 2200
 yarn.scheduler.minimum-allocation-mb = 250
 {quote}
 On my client I did
 {quote}
 mapreduce.map.memory.mb = 512
 mapreduce.reduce.memory.mb = 512
 {quote}
 Now I run a job with 27 mappers and 32 reducers.
 After a while I saw this deadlock occur:
 - All nodes had been filled to their maximum capacity with reducers.
 - 1 Mapper was waiting for a container slot to start in.
 I tried killing reducer attempts but that didn't help (new reducer attempts 
 simply took the existing container).
 *Workaround*:
 I set this value from my job. The default value is 0.05 (= 5%)
 {quote}
 mapreduce.job.reduce.slowstart.completedmaps = 0.99f
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5928) Deadlock allocating containers for mappers and reducers

2014-06-16 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032831#comment-14032831
 ] 

Niels Basjes commented on MAPREDUCE-5928:
-

Confirmed It is using the CapacityScheduler:
{code}
property
nameyarn.resourcemanager.scheduler.class/name

valueorg.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler/value
sourceyarn-default.xml/source
/property
{code}

I'm going to fiddle with the memory setting tomorrow.

 Deadlock allocating containers for mappers and reducers
 ---

 Key: MAPREDUCE-5928
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5928
 Project: Hadoop Map/Reduce
  Issue Type: Bug
 Environment: Hadoop 2.4.0 (as packaged by HortonWorks in HDP 2.1.2)
Reporter: Niels Basjes
 Attachments: AM-MR-syslog - Cleaned.txt.gz, Cluster fully 
 loaded.png.jpg, MR job stuck in deadlock.png.jpg


 I have a small cluster consisting of 8 desktop class systems (1 master + 7 
 workers).
 Due to the small memory of these systems I configured yarn as follows:
 {quote}
 yarn.nodemanager.resource.memory-mb = 2200
 yarn.scheduler.minimum-allocation-mb = 250
 {quote}
 On my client I did
 {quote}
 mapreduce.map.memory.mb = 512
 mapreduce.reduce.memory.mb = 512
 {quote}
 Now I run a job with 27 mappers and 32 reducers.
 After a while I saw this deadlock occur:
 - All nodes had been filled to their maximum capacity with reducers.
 - 1 Mapper was waiting for a container slot to start in.
 I tried killing reducer attempts but that didn't help (new reducer attempts 
 simply took the existing container).
 *Workaround*:
 I set this value from my job. The default value is 0.05 (= 5%)
 {quote}
 mapreduce.job.reduce.slowstart.completedmaps = 0.99f
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAPREDUCE-5925) NLineInputFormat silently produces garbage on gzipped input

2014-06-13 Thread Niels Basjes (JIRA)
Niels Basjes created MAPREDUCE-5925:
---

 Summary: NLineInputFormat silently produces garbage on gzipped 
input
 Key: MAPREDUCE-5925
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5925
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Niels Basjes
Priority: Critical


[ Found while investigating the impact of MAPREDUCE-2094 ]

The org.apache.hadoop.mapreduce.lib.input.NLineInputFormat (probably the mapred 
version too) only makes sense for splittable files.

This inputformat uses the isSplitable from its superclass FileInputFormat 
(which always returns true) in combination with the LineRecordReader.

When you provide it a gzipped file (non-splittable compression) it will create 
multiple splits (isSplitable == true) yet the LineRecordReader cannot handle 
the gzipped file in multiple splits because the GzipCodec does not support this.

Overall effect is that you get incorrect results.

Proposed solution: Add detection for this kind of scenario and let the 
NLineInputFormat fail hard when someone tries this. 

I'm not sure if this should go into the LineRecordReader or only in the 
NLineInputFormat.







--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-346) Report Map-Reduce Framework Counters in pipeline order

2012-12-29 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540994#comment-13540994
 ] 

Niels Basjes commented on MAPREDUCE-346:


I just ran the wordcount example from the current trunk and the output looks 
like this:
{code}
File System Counters
FILE: Number of bytes read=1355162
FILE: Number of bytes written=4112351
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=1082
Map output records=3884
Map output bytes=53040
Map output materialized bytes=42292
Input split bytes=2785
Combine input records=3884
Combine output records=2263
Reduce input groups=990
Reduce shuffle bytes=0
Reduce input records=2263
Reduce output records=990
Spilled Records=5299
Shuffled Maps =0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=677
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=8783134720
File Input Format Counters 
Bytes Read=38869
File Output Format Counters 
Bytes Written=21444
{code}

To me the order of a few items in this list appear out of place: I would 
expect 
Shuffled Maps, Failed Shuffles and Merged Map outputs somewhere between 
the map and the reduce instead of after the reduce.

Other than that (and the trailing space at the end of Shuffled Maps ) I 
propose we close this improvement as Fixed.


 Report Map-Reduce Framework Counters in pipeline order
 --

 Key: MAPREDUCE-346
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-346
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Sharad Agarwal
Assignee: Sharad Agarwal
 Attachments: 5216_v1.patch


 Currently there is no order in which counters are printed. It would be more 
 user friendly if Map-Reduce Framework counters are reported in the pipeline 
 order.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2011-05-19 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-2094:


Attachment: MAPREDUCE-2094-2011-05-19.patch

I've created a patch that in my mind fixes this issue the correct way. If this 
is really the case if very open to discussion.

There are basically 4 current situations:
# People use the existing FileInputFormat derivatives. This patch ensures that 
all of those that are present in the existing code base (including the 
examples) still retain the same behavior as before.
# People have created a new derivative and HAVE overridden isSplitable with 
something that fits their needs. This patch does not change those situations.
# People have created a new derivative and have *NOT* overridden isSplitable. 
## If their input is in a splittable form (like LZO or uncompressed); then this 
patch will not affect them
## In the situation where they have big non-splittable input files (like gzip 
files) they will have ran into unexpected errors. Possibly they will spent a 
lot of time looking for performance issues and wrong results in production that 
did not occur during unit testing (we did!). This patch will fix this problem 
without any code changes in their code base.

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Affects Versions: 0.20.1, 0.20.2, 0.21.0
Reporter: Niels Basjes
 Attachments: MAPREDUCE-2094-2011-05-19.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2011-05-19 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-2094:


 Assignee: Niels Basjes
Affects Version/s: (was: 0.20.2)
   (was: 0.20.1)
   (was: 0.21.0)
 Release Note: Fixed splitting errors present in many FileInputFormat 
derivatives that do not override isSplitable. 
   Status: Patch Available  (was: Open)

I've not marked this as an Incompatible change because the behaviour is only 
changed in the situations where there was an error condition.

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Reporter: Niels Basjes
Assignee: Niels Basjes
 Attachments: MAPREDUCE-2094-2011-05-19.patch


 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2010-10-07 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918851#action_12918851
 ] 

Niels Basjes commented on MAPREDUCE-2094:
-

I just noticed that the Yahoo Hadoop tutorial [Module 5: Advanced MapReduce 
Features |http://developer.yahoo.com/hadoop/tutorial/module5.html]; shows a 
code example for defining your own 
[FileInputFormat|http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat].
The shown example code implements a derivative using FileInputFormat and 
LineRecordReader without overruling isSplittable ... I expect this tutorial 
code to lead people into this bug.

Since this bug will only become apparent when using large non splittable 
(gzipped) input files it is also important to notice that almost no one will 
have a (unit) test that will trip on this bug.

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Affects Versions: 0.20.1, 0.20.2, 0.21.0
Reporter: Niels Basjes

 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method abstract.
 # Use a safe default (i.e. return false)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2010-09-29 Thread Niels Basjes (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916076#action_12916076
 ] 

Niels Basjes commented on MAPREDUCE-2094:
-

bq. Changing the default implementation to return false would be an 
incompatible change, potentially breaking existing subclasses. 

If you mean with breaking :  Some subclasses will see an unexpected 
performance degradation . then Yes, that will most likely occur ( first one I 
think of is SequenceFileInputFormat ).
I however do not expect any functional breaking of the output of these 
subclasses.

bq. Making the method abstract would also be incompatible and break subclasses, 
but in a way that they'd easily detect. 

Yes. 
The downside of this option is that if subclasses want to have detection 
depending on the compression I expect a lot of code duplication to occur. 
This code duplication is already occurring within the main code base in 
KeyValueTextInputFormat , TextInputFormat, CombineFileInputFormat and their old 
API counterparts (I found a total of 5 duplicates of the same isSplittable 
implementation).

bq. Perhaps the javadoc should just be clarified to better document this 
default?

Definitely an option. However this would not fix the effect in the existing 
subclasses.

I just did a quick manual code check of the current trunk and I found that the 
following classes are derived from FileInputFormat yet do not  implement the 
isSplittable method (and thus use return true).
* ./src/java/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.java
* 
./src/contrib/streaming/src/java/org/apache/hadoop/streaming/AutoInputFormat.java
* ./src/java/org/apache/hadoop/mapred/SequenceFileInputFormat.java

I expect that the NLineInputFormat and the AutoInputFormat will affected by 
this large gzip bug.
So expect that simply fixing the isSplittable documentation would lead to the 
need to fix *at least* these two classes.

As far as I understand the SequenceFileInputFormat can only be compressed using 
a splittable compression, so the return true; from FileInputFormat will 
work fine there.

Overall I still prefer the clean option of  returning the correct value 
depending on the compression. That would effectively leave the behavior in most 
use cases unchanged. Yet in those cases where splitting is known to cause 
problems it would avoid those problems. Thus avoiding major issues like the 
ones we had and described in HADOOP-6901.
For the SequenceFileInputFormat it may be needed to implement the isSplittable 
as return true;

Effectively the set of changes I propose (in both the old and new API versions 
of these classes):
1) FileInputFormat . isSplittable  gets the implementation as seen in 
TextInputFormat
2) The isSplittable implementation is removed from  KeyValueTextInputFormat , 
TextInputFormat, CombineFileInputFormat (useless code duplication)
3) The isSplittable implementation return true is added to 
SequenceFileInputFormat. Given the fact that you cannot gzip a sequencefile I 
expect this to be an optional fix.


 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Affects Versions: 0.20.1, 0.20.2, 0.21.0
Reporter: Niels Basjes

 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable in our class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order 

[jira] Created: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2010-09-28 Thread Niels Basjes (JIRA)
org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
unsafe default behaviour that is different from the documented behaviour.
---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.21.0, 0.20.2, 0.20.1
Reporter: Niels Basjes


When implementing a custom derivative of FileInputFormat we ran into the effect 
that a large Gzipped input file would be processed several times. A near 1GiB 
file would be processed around 36 times in its entirety. Thus producing garbage 
results and taking up a lot more CPU time than needed.

It took a while to figure out and what we found is that the default 
implementation of the isSplittable method in 
[org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
 ] is simply return true;. This is a very unsafe default and is in 
contradcition with the JavaDoc of the method which states: Is the given 
filename splitable? Usually, true, but if the file is stream compressed, it 
will not be.  

For our situation (where we always have Gzipped input) we took the easy way out 
and simply implemented an isSplittable inour class that does return false; 

Now there are essentially 3 ways I can think of for fixing this (in order of 
what I would find preferable):
# Implement something that looks at the used compression of the file (i.e. do 
migrate the implementation from TextInputFormat to FileInputFormat). This would 
make the method do what the JavaDoc describes.
# Force developers to think about it and make this method (and therfor the 
entire FileInputFormat class) abstract.
# Use a safe default (i.e. return false)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2010-09-28 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-2094:


Issue Type: Bug  (was: Improvement)

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Affects Versions: 0.20.1, 0.20.2, 0.21.0
Reporter: Niels Basjes

 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. A 
 near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. This is a very unsafe default and is in 
 contradcition with the JavaDoc of the method which states: Is the given 
 filename splitable? Usually, true, but if the file is stream compressed, it 
 will not be.  
 For our situation (where we always have Gzipped input) we took the easy way 
 out and simply implemented an isSplittable inour class that does return 
 false; 
 Now there are essentially 3 ways I can think of for fixing this (in order of 
 what I would find preferable):
 # Implement something that looks at the used compression of the file (i.e. do 
 migrate the implementation from TextInputFormat to FileInputFormat). This 
 would make the method do what the JavaDoc describes.
 # Force developers to think about it and make this method (and therfor the 
 entire FileInputFormat class) abstract.
 # Use a safe default (i.e. return false)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-2094) org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements unsafe default behaviour that is different from the documented behaviour.

2010-09-28 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes updated MAPREDUCE-2094:


Description: 
When implementing a custom derivative of FileInputFormat we ran into the effect 
that a large Gzipped input file would be processed several times. 

A near 1GiB file would be processed around 36 times in its entirety. Thus 
producing garbage results and taking up a lot more CPU time than needed.

It took a while to figure out and what we found is that the default 
implementation of the isSplittable method in 
[org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
 ] is simply return true;. 

This is a very unsafe default and is in contradiction with the JavaDoc of the 
method which states: Is the given filename splitable? Usually, true, but if 
the file is stream compressed, it will not be.  . The actual implementation 
effectively does Is the given filename splitable? Always true, even if the 
file is stream compressed using an unsplittable compression codec. 

For our situation (where we always have Gzipped input) we took the easy way out 
and simply implemented an isSplittable in our class that does return false; 

Now there are essentially 3 ways I can think of for fixing this (in order of 
what I would find preferable):
# Implement something that looks at the used compression of the file (i.e. do 
migrate the implementation from TextInputFormat to FileInputFormat). This would 
make the method do what the JavaDoc describes.
# Force developers to think about it and make this method abstract.
# Use a safe default (i.e. return false)

  was:
When implementing a custom derivative of FileInputFormat we ran into the effect 
that a large Gzipped input file would be processed several times. A near 1GiB 
file would be processed around 36 times in its entirety. Thus producing garbage 
results and taking up a lot more CPU time than needed.

It took a while to figure out and what we found is that the default 
implementation of the isSplittable method in 
[org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
 ] is simply return true;. This is a very unsafe default and is in 
contradcition with the JavaDoc of the method which states: Is the given 
filename splitable? Usually, true, but if the file is stream compressed, it 
will not be.  

For our situation (where we always have Gzipped input) we took the easy way out 
and simply implemented an isSplittable inour class that does return false; 

Now there are essentially 3 ways I can think of for fixing this (in order of 
what I would find preferable):
# Implement something that looks at the used compression of the file (i.e. do 
migrate the implementation from TextInputFormat to FileInputFormat). This would 
make the method do what the JavaDoc describes.
# Force developers to think about it and make this method (and therfor the 
entire FileInputFormat class) abstract.
# Use a safe default (i.e. return false)


 org.apache.hadoop.mapreduce.lib.input.FileInputFormat: isSplitable implements 
 unsafe default behaviour that is different from the documented behaviour.
 ---

 Key: MAPREDUCE-2094
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2094
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Affects Versions: 0.20.1, 0.20.2, 0.21.0
Reporter: Niels Basjes

 When implementing a custom derivative of FileInputFormat we ran into the 
 effect that a large Gzipped input file would be processed several times. 
 A near 1GiB file would be processed around 36 times in its entirety. Thus 
 producing garbage results and taking up a lot more CPU time than needed.
 It took a while to figure out and what we found is that the default 
 implementation of the isSplittable method in 
 [org.apache.hadoop.mapreduce.lib.input.FileInputFormat | 
 http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java?view=markup
  ] is simply return true;. 
 This is a very unsafe default and is in contradiction with the JavaDoc of the 
 method which states: Is the given filename splitable? Usually, true, but if 
 the file is stream compressed, it will not be.  . The actual implementation 
 effectively does Is the given filename splitable? Always true, even if the 
 file is stream compressed using an unsplittable compression codec. 
 For our situation (where we always have Gzipped input) we took the easy way 
 out and