[jira] [Commented] (MAPREDUCE-4502) Multi-level aggregation with combining the result of maps per node/rack

2013-02-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585713#comment-13585713
 ] 

Hadoop QA commented on MAPREDUCE-4502:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12570735/MAPREDUCE-4502.5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

  {color:red}-1 one of tests included doesn't have a timeout.{color}

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3357//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3357//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3357//console

This message is automatically generated.

 Multi-level aggregation with combining the result of maps per node/rack
 ---

 Key: MAPREDUCE-4502
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4502
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: applicationmaster, mrv2
Affects Versions: 3.0.0
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: design_v2.pdf, MAPREDUCE-4502.1.patch, 
 MAPREDUCE-4502.2.patch, MAPREDUCE-4502.3.patch, MAPREDUCE-4502.4.patch, 
 MAPREDUCE-4502.5.patch, MAPREDUCE-4525-pof.diff, speculative_draft.pdf


 The shuffle costs is expensive in Hadoop in spite of the existence of 
 combiner, because the scope of combining is limited within only one MapTask. 
 To solve this problem, it's a good way to aggregate the result of maps per 
 node/rack by launch combiner.
 This JIRA is to implement the multi-level aggregation infrastructure, 
 including combining per container(MAPREDUCE-3902 is related), coordinating 
 containers by application master without breaking fault tolerance of jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4502) Multi-level aggregation with combining the result of maps per node/rack

2013-02-25 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated MAPREDUCE-4502:
--

Attachment: MAPREDUCE-4502.6.patch

Oops, I attached the wrong patch. This is correct one.

 Multi-level aggregation with combining the result of maps per node/rack
 ---

 Key: MAPREDUCE-4502
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4502
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: applicationmaster, mrv2
Affects Versions: 3.0.0
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: design_v2.pdf, MAPREDUCE-4502.1.patch, 
 MAPREDUCE-4502.2.patch, MAPREDUCE-4502.3.patch, MAPREDUCE-4502.4.patch, 
 MAPREDUCE-4502.5.patch, MAPREDUCE-4502.6.patch, MAPREDUCE-4525-pof.diff, 
 speculative_draft.pdf


 The shuffle costs is expensive in Hadoop in spite of the existence of 
 combiner, because the scope of combining is limited within only one MapTask. 
 To solve this problem, it's a good way to aggregate the result of maps per 
 node/rack by launch combiner.
 This JIRA is to implement the multi-level aggregation infrastructure, 
 including combining per container(MAPREDUCE-3902 is related), coordinating 
 containers by application master without breaking fault tolerance of jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-5025) Key Distribution and Management for supporting crypto codec in Map Reduce

2013-02-25 Thread Jerry Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Chen updated MAPREDUCE-5025:
--

Attachment: (was: MAPREDUCE-5025.patch)

 Key Distribution and Management for supporting crypto codec in Map Reduce
 -

 Key: MAPREDUCE-5025
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5025
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: security
Affects Versions: trunk
Reporter: Jerry Chen
Assignee: Jerry Chen
   Original Estimate: 504h
  Remaining Estimate: 504h

 This task defines the work to enable Map Reduce to utilize the Crypto Codec 
 framework to support encryption and decryption of data during MapReduce Job.
 According to the some real use case and discussions from the community, for 
 encryption and decryption files in Map Reduce, we have the following 
 requirements:
   1. Different stages (input, output, intermediate output) should have the 
 flexibility to choose whether encrypt or not, as well as which crypto codec 
 to use.
   2. Different stages may have different scheme of providing the keys.
   3. Different Files (for example, different input files) may have or use 
 different keys. 
   4. Support a flexible way of retrieving keys for encryption or decryption.
 So this task defines and provides the framework for supporting these 
 requirements as well as the implementations for common use and key retrieving 
 scenarios.
 The design document of this part is included in the Hadoop Crypto Design 
 attached in HADOOP-9331.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-5025) Key Distribution and Management for supporting crypto codec in Map Reduce

2013-02-25 Thread Jerry Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Chen updated MAPREDUCE-5025:
--

Attachment: MAPREDUCE-5025.patch

 Key Distribution and Management for supporting crypto codec in Map Reduce
 -

 Key: MAPREDUCE-5025
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5025
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: security
Affects Versions: trunk
Reporter: Jerry Chen
Assignee: Jerry Chen
 Attachments: MAPREDUCE-5025.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 This task defines the work to enable Map Reduce to utilize the Crypto Codec 
 framework to support encryption and decryption of data during MapReduce Job.
 According to the some real use case and discussions from the community, for 
 encryption and decryption files in Map Reduce, we have the following 
 requirements:
   1. Different stages (input, output, intermediate output) should have the 
 flexibility to choose whether encrypt or not, as well as which crypto codec 
 to use.
   2. Different stages may have different scheme of providing the keys.
   3. Different Files (for example, different input files) may have or use 
 different keys. 
   4. Support a flexible way of retrieving keys for encryption or decryption.
 So this task defines and provides the framework for supporting these 
 requirements as well as the implementations for common use and key retrieving 
 scenarios.
 The design document of this part is included in the Hadoop Crypto Design 
 attached in HADOOP-9331.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4502) Multi-level aggregation with combining the result of maps per node/rack

2013-02-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585765#comment-13585765
 ] 

Hadoop QA commented on MAPREDUCE-4502:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12570748/MAPREDUCE-4502.6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 tests included appear to have a timeout.{color}

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3358//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3358//console

This message is automatically generated.

 Multi-level aggregation with combining the result of maps per node/rack
 ---

 Key: MAPREDUCE-4502
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4502
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: applicationmaster, mrv2
Affects Versions: 3.0.0
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: design_v2.pdf, MAPREDUCE-4502.1.patch, 
 MAPREDUCE-4502.2.patch, MAPREDUCE-4502.3.patch, MAPREDUCE-4502.4.patch, 
 MAPREDUCE-4502.5.patch, MAPREDUCE-4502.6.patch, MAPREDUCE-4525-pof.diff, 
 speculative_draft.pdf


 The shuffle costs is expensive in Hadoop in spite of the existence of 
 combiner, because the scope of combining is limited within only one MapTask. 
 To solve this problem, it's a good way to aggregate the result of maps per 
 node/rack by launch combiner.
 This JIRA is to implement the multi-level aggregation infrastructure, 
 including combining per container(MAPREDUCE-3902 is related), coordinating 
 containers by application master without breaking fault tolerance of jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4974) Optimising the LineRecordReader initialize() method

2013-02-25 Thread Arun A K (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585797#comment-13585797
 ] 

Arun A K commented on MAPREDUCE-4974:
-

As [~gelesh] has mentioned, we had in mind, elimination of repeated null 
checks, while trying to optimize the code. If it is of not much significance, 
please go ahead with the latest available patch containing the rest of changes.

 Optimising the LineRecordReader initialize() method
 ---

 Key: MAPREDUCE-4974
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4974
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: mrv1, mrv2, performance
Affects Versions: 2.0.2-alpha, 0.23.5
 Environment: Hadoop Linux
Reporter: Arun A K
Assignee: Gelesh
  Labels: patch, performance
 Attachments: MAPREDUCE-4974.1.patch, MAPREDUCE-4974.2.patch, 
 MAPREDUCE-4974.3.patch, MAPREDUCE-4974.4.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 I found there is a a scope of optimizing the code, over initialize() if we 
 have compressionCodecs  codec instantiated only if its a compressed input.
 Mean while Gelesh George Omathil, added if we could avoid the null check of 
 key  value. This would time save, since for every next key value generation, 
 null check is done. The intention being to instantiate only once and avoid 
 NPE as well. Hope both could be met if initialize key  value over  
 initialize() method. We both have worked on it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5006) streaming tests failing

2013-02-25 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585963#comment-13585963
 ] 

Tom White commented on MAPREDUCE-5006:
--

It's the current behaviour - in branch-1 as well - so if we want to change it, 
we should do it in a compatible way, e.g. with mapred.local.job.maps as you 
suggested. That should be a different JIRA though, and this one should fix the 
tests by reverting the relevant part of MAPREDUCE-4994.

 streaming tests failing
 ---

 Key: MAPREDUCE-5006
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5006
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 2.0.4-beta
Reporter: Alejandro Abdelnur
Assignee: Sandy Ryza
 Attachments: MAPREDUCE-5006.patch


 The following 2 tests are failing in trunk
 * org.apache.hadoop.streaming.TestStreamReduceNone
 * org.apache.hadoop.streaming.TestStreamXmlRecordReader

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Moved] (MAPREDUCE-5026) For shortening the time of TaskTracker heartbeat, decouple the statics collection operations

2013-02-25 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang moved HDFS-4527 to MAPREDUCE-5026:
--

  Component/s: (was: performance)
   tasktracker
   performance
Fix Version/s: (was: 1.1.1)
   1.1.1
 Target Version/s:   (was: 1.1.1)
Affects Version/s: (was: 1.1.1)
   1.1.1
  Key: MAPREDUCE-5026  (was: HDFS-4527)
  Project: Hadoop Map/Reduce  (was: Hadoop HDFS)

 For shortening the time of TaskTracker heartbeat, decouple the statics 
 collection operations
 

 Key: MAPREDUCE-5026
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5026
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: performance, tasktracker
Affects Versions: 1.1.1
Reporter: sam liu
  Labels: patch
 Fix For: 1.1.1

 Attachments: HDFS-4527.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 In each heartbeat of TaskTracker, it will calculate some system statics, like 
 the free disk space, available virtual/physical memory, cpu usage, etc. 
 However, it's not necessary to calculate all the statics in every heartbeat, 
 and this will consume many system resource and impace the performance of 
 TaskTracker heartbeat. Furthermore, the characteristics of system 
 properties(disk, memory, cpu) are different and it's better to collect their 
 statics in different intervals.
 To reduce the latency of TaskTracker heartbeat, one solution is to decouple 
 all the system statics collection operations from it, and issue separate 
 threads to do the statics collection works when the TaskTracker starts. The 
 threads could be three: the first one is to collect cpu related statics in a 
 short interval; the second one is to collect memory related statics in a 
 normal interval; the third one is to collect disk related statics in a long 
 interval. And all the interval could be customized by the parameter 
 mapred.stats.collection.interval in the mapred-site.xml. At last, the 
 heartbeat could get values of system statics from the memory directly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5026) For shortening the time of TaskTracker heartbeat, decouple the statics collection operations

2013-02-25 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586136#comment-13586136
 ] 

Andrew Wang commented on MAPREDUCE-5026:


Hi Sam,

Thanks for the patch. I moved your issue to MAPREDUCE, since the TaskTracker 
isn't a component of HDFS.

A few minor comments:

* Please rename Statics to Statistics in the code.
* Could you provide some performance numbers, to quantify the before and after 
improvement?

 For shortening the time of TaskTracker heartbeat, decouple the statics 
 collection operations
 

 Key: MAPREDUCE-5026
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5026
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: performance, tasktracker
Affects Versions: 1.1.1
Reporter: sam liu
  Labels: patch
 Fix For: 1.1.1

 Attachments: HDFS-4527.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 In each heartbeat of TaskTracker, it will calculate some system statics, like 
 the free disk space, available virtual/physical memory, cpu usage, etc. 
 However, it's not necessary to calculate all the statics in every heartbeat, 
 and this will consume many system resource and impace the performance of 
 TaskTracker heartbeat. Furthermore, the characteristics of system 
 properties(disk, memory, cpu) are different and it's better to collect their 
 statics in different intervals.
 To reduce the latency of TaskTracker heartbeat, one solution is to decouple 
 all the system statics collection operations from it, and issue separate 
 threads to do the statics collection works when the TaskTracker starts. The 
 threads could be three: the first one is to collect cpu related statics in a 
 short interval; the second one is to collect memory related statics in a 
 normal interval; the third one is to collect disk related statics in a long 
 interval. And all the interval could be customized by the parameter 
 mapred.stats.collection.interval in the mapred-site.xml. At last, the 
 heartbeat could get values of system statics from the memory directly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAPREDUCE-5027) Shuffle does not limit number of outstanding connections

2013-02-25 Thread Jason Lowe (JIRA)
Jason Lowe created MAPREDUCE-5027:
-

 Summary: Shuffle does not limit number of outstanding connections
 Key: MAPREDUCE-5027
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5027
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.23.5, 2.0.3-alpha
Reporter: Jason Lowe


The ShuffleHandler does not have any configurable limits to the number of 
outstanding connections allowed.  Therefore a node with many map outputs and 
many reducers in the cluster trying to fetch those outputs can exhaust a 
nodemanager out of file descriptors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5027) Shuffle does not limit number of outstanding connections

2013-02-25 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586267#comment-13586267
 ] 

Jason Lowe commented on MAPREDUCE-5027:
---

AFAIK there is no built-in way to have Netty automatically limit the number of 
active client connections.  A quick search on the net indicates one way this is 
handled is to simply close the extra connections as soon as they are created 
once we get past a specified number of active connections.

 Shuffle does not limit number of outstanding connections
 

 Key: MAPREDUCE-5027
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5027
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Jason Lowe

 The ShuffleHandler does not have any configurable limits to the number of 
 outstanding connections allowed.  Therefore a node with many map outputs and 
 many reducers in the cluster trying to fetch those outputs can exhaust a 
 nodemanager out of file descriptors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5027) Shuffle does not limit number of outstanding connections

2013-02-25 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586431#comment-13586431
 ] 

Alejandro Abdelnur commented on MAPREDUCE-5027:
---

The following may be handy: https://issues.jboss.org/browse/NETTY-311

 Shuffle does not limit number of outstanding connections
 

 Key: MAPREDUCE-5027
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5027
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Jason Lowe

 The ShuffleHandler does not have any configurable limits to the number of 
 outstanding connections allowed.  Therefore a node with many map outputs and 
 many reducers in the cluster trying to fetch those outputs can exhaust a 
 nodemanager out of file descriptors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-5006) streaming tests failing

2013-02-25 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated MAPREDUCE-5006:
--

Attachment: MAPREDUCE-5006-1.patch

 streaming tests failing
 ---

 Key: MAPREDUCE-5006
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5006
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 2.0.4-beta
Reporter: Alejandro Abdelnur
Assignee: Sandy Ryza
 Attachments: MAPREDUCE-5006-1.patch, MAPREDUCE-5006.patch


 The following 2 tests are failing in trunk
 * org.apache.hadoop.streaming.TestStreamReduceNone
 * org.apache.hadoop.streaming.TestStreamXmlRecordReader

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5006) streaming tests failing

2013-02-25 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586551#comment-13586551
 ] 

Sandy Ryza commented on MAPREDUCE-5006:
---

Ok, uploaded a patch that reverts the relevant part of MAPREDUCE-4994.

 streaming tests failing
 ---

 Key: MAPREDUCE-5006
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5006
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 2.0.4-beta
Reporter: Alejandro Abdelnur
Assignee: Sandy Ryza
 Attachments: MAPREDUCE-5006-1.patch, MAPREDUCE-5006.patch


 The following 2 tests are failing in trunk
 * org.apache.hadoop.streaming.TestStreamReduceNone
 * org.apache.hadoop.streaming.TestStreamXmlRecordReader

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5006) streaming tests failing

2013-02-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586572#comment-13586572
 ] 

Hadoop QA commented on MAPREDUCE-5006:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12570900/MAPREDUCE-5006-1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 tests included appear to have a timeout.{color}

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-tools/hadoop-streaming.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3359//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3359//console

This message is automatically generated.

 streaming tests failing
 ---

 Key: MAPREDUCE-5006
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5006
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming
Affects Versions: 2.0.4-beta
Reporter: Alejandro Abdelnur
Assignee: Sandy Ryza
 Attachments: MAPREDUCE-5006-1.patch, MAPREDUCE-5006.patch


 The following 2 tests are failing in trunk
 * org.apache.hadoop.streaming.TestStreamReduceNone
 * org.apache.hadoop.streaming.TestStreamXmlRecordReader

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5025) Key Distribution and Management for supporting crypto codec in Map Reduce

2013-02-25 Thread Jerry Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586660#comment-13586660
 ] 

Jerry Chen commented on MAPREDUCE-5025:
---

[~owen.omalley]
The reason we proposed file formats extended with encryption support instead of 
something at a lower layer is so the user can selectively apply encryption only 
where they feel necessary, only for those MR jobs that require it. This also 
helps keep the resulting files and the usage of encryption filesystem agnostic.

Adding transparent encryption to a filesystem is an interesting idea and 
something that we also prototyped as part of this work. Perhaps a Common JIRA 
for an encrypting filesystem derived from FileSystem would be appropriate? Or 
an HDFS JIRA  for plugging in compression and crypto codecs to block storage 
and transfer? We could look at something like those for follow on work.


 Key Distribution and Management for supporting crypto codec in Map Reduce
 -

 Key: MAPREDUCE-5025
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5025
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: security
Affects Versions: trunk
Reporter: Jerry Chen
Assignee: Jerry Chen
 Attachments: MAPREDUCE-5025.patch

   Original Estimate: 504h
  Remaining Estimate: 504h

 This task defines the work to enable Map Reduce to utilize the Crypto Codec 
 framework to support encryption and decryption of data during MapReduce Job.
 According to the some real use case and discussions from the community, for 
 encryption and decryption files in Map Reduce, we have the following 
 requirements:
   1. Different stages (input, output, intermediate output) should have the 
 flexibility to choose whether encrypt or not, as well as which crypto codec 
 to use.
   2. Different stages may have different scheme of providing the keys.
   3. Different Files (for example, different input files) may have or use 
 different keys. 
   4. Support a flexible way of retrieving keys for encryption or decryption.
 So this task defines and provides the framework for supporting these 
 requirements as well as the implementations for common use and key retrieving 
 scenarios.
 The design document of this part is included in the Hadoop Crypto Design 
 attached in HADOOP-9331.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAPREDUCE-5028) Maps fail when io.sort.mb is set to high value

2013-02-25 Thread Karthik Kambatla (JIRA)
Karthik Kambatla created MAPREDUCE-5028:
---

 Summary: Maps fail when io.sort.mb is set to high value
 Key: MAPREDUCE-5028
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5028
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.23.5, 2.0.3-alpha, 1.1.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical


Verified the problem exists on branch-1 with the following configuration:

Pseudo-dist mode: 2 maps/ 1 reduce, mapred.child.java.opts=-Xmx2048m, 
io.sort.mb=1280, dfs.block.size=2147483648

Run teragen to generate 4 GB data
Maps fail when you run wordcount on this configuration with the following 
error: 
{noformat}
java.io.IOException: Spill failed
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1031)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:692)
at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at 
org.apache.hadoop.examples.WordCount$TokenizerMapper.map(WordCount.java:45)
at 
org.apache.hadoop.examples.WordCount$TokenizerMapper.map(WordCount.java:34)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:38)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at 
org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
at 
org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
at 
org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1505)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1438)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:855)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1346)
{noformat}

Marked branch-0.23 and branch-2 also because the offending code seems to exist 
there too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5028) Maps fail when io.sort.mb is set to high value

2013-02-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586708#comment-13586708
 ] 

Karthik Kambatla commented on MAPREDUCE-5028:
-

http://comments.gmane.org/gmane.comp.java.hadoop.mapreduce.user/2485 looks like 
the same issue.

 Maps fail when io.sort.mb is set to high value
 --

 Key: MAPREDUCE-5028
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5028
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 1.1.1, 2.0.3-alpha, 0.23.5
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical

 Verified the problem exists on branch-1 with the following configuration:
 Pseudo-dist mode: 2 maps/ 1 reduce, mapred.child.java.opts=-Xmx2048m, 
 io.sort.mb=1280, dfs.block.size=2147483648
 Run teragen to generate 4 GB data
 Maps fail when you run wordcount on this configuration with the following 
 error: 
 {noformat}
 java.io.IOException: Spill failed
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1031)
   at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:692)
   at 
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
   at 
 org.apache.hadoop.examples.WordCount$TokenizerMapper.map(WordCount.java:45)
   at 
 org.apache.hadoop.examples.WordCount$TokenizerMapper.map(WordCount.java:34)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 Caused by: java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:375)
   at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:38)
   at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
   at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
   at 
 org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
   at 
 org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
   at 
 org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1505)
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1438)
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:855)
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1346)
 {noformat}
 Marked branch-0.23 and branch-2 also because the offending code seems to 
 exist there too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-5028) Maps fail when io.sort.mb is set to high value

2013-02-25 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated MAPREDUCE-5028:


Attachment: mr-5028-branch1.patch

Uploading a patch that fixes the issue.

Turns out buggy use of {{DataInputBuffer.reset()}} and 
{{DataInputBuffer.getLength()}} was leading to integer overflow at these 
considerable large values of io.sort.mb.

{{getLength()}} returns the size of the entire buffer, and not just the 
remaining part of the buffer. The offending code assumes otherwise leading to 
this issue.

For instance, if a key is at position 1,224,906,830 extending to 1,224,906,868. 
The length of this key should be set to 38 - the code instead sets it to 
1,224,906,868. The data buffer interprets this key to end at 2,449,813,698 
which is bigger than 2^31-1 leading to a negative value!!

For something like this to happen, the starting position of the key should be 
large ~  (2^31 - 1 - key_size)/2 = 1073741823.5 - key_size/2.

The reported 1280 MB is larger than this, and hence we see the issue.

The patch fixes the use of reset() and getLength() as required, and also 
updates the javadoc of getLength().

Verified that the patch fixes the problem for the same cases as in the 
description. Not sure how to write a test for this.

 Maps fail when io.sort.mb is set to high value
 --

 Key: MAPREDUCE-5028
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5028
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 1.1.1, 2.0.3-alpha, 0.23.5
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: mr-5028-branch1.patch


 Verified the problem exists on branch-1 with the following configuration:
 Pseudo-dist mode: 2 maps/ 1 reduce, mapred.child.java.opts=-Xmx2048m, 
 io.sort.mb=1280, dfs.block.size=2147483648
 Run teragen to generate 4 GB data
 Maps fail when you run wordcount on this configuration with the following 
 error: 
 {noformat}
 java.io.IOException: Spill failed
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1031)
   at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:692)
   at 
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
   at 
 org.apache.hadoop.examples.WordCount$TokenizerMapper.map(WordCount.java:45)
   at 
 org.apache.hadoop.examples.WordCount$TokenizerMapper.map(WordCount.java:34)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 Caused by: java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:375)
   at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:38)
   at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
   at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
   at 
 org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
   at 
 org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
   at 
 org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1505)
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1438)
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:855)
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1346)
 {noformat}
 Marked branch-0.23 and branch-2 also because the offending code seems to 
 exist there too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAPREDUCE-5029) Recursively take all files in the directories of a root directory

2013-02-25 Thread Abhilash S R (JIRA)
Abhilash S R created MAPREDUCE-5029:
---

 Summary: Recursively take all files in the directories of a root 
directory
 Key: MAPREDUCE-5029
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5029
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 0.20.2
Reporter: Abhilash S R


Suppose we have a root directories with 1000's of sub directories and in each 
directory there can be 100's of files.So while specifying the root directory in 
the input path in map-reduce the program crashes due to sub directories in the 
root directory.So if this feature is includes in latest version it will be 
great helpful for programers.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4974) Optimising the LineRecordReader initialize() method

2013-02-25 Thread Surenkumar Nihalani (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586823#comment-13586823
 ] 

Surenkumar Nihalani commented on MAPREDUCE-4974:


The one line of code that seems to be missing is key.set(pos). where is that 
being handled?

 Optimising the LineRecordReader initialize() method
 ---

 Key: MAPREDUCE-4974
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4974
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: mrv1, mrv2, performance
Affects Versions: 2.0.2-alpha, 0.23.5
 Environment: Hadoop Linux
Reporter: Arun A K
Assignee: Gelesh
  Labels: patch, performance
 Attachments: MAPREDUCE-4974.1.patch, MAPREDUCE-4974.2.patch, 
 MAPREDUCE-4974.3.patch, MAPREDUCE-4974.4.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 I found there is a a scope of optimizing the code, over initialize() if we 
 have compressionCodecs  codec instantiated only if its a compressed input.
 Mean while Gelesh George Omathil, added if we could avoid the null check of 
 key  value. This would time save, since for every next key value generation, 
 null check is done. The intention being to instantiate only once and avoid 
 NPE as well. Hope both could be met if initialize key  value over  
 initialize() method. We both have worked on it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira