[jira] [Commented] (MAPREDUCE-4502) Multi-level aggregation with combining the result of maps per node/rack
[ https://issues.apache.org/jira/browse/MAPREDUCE-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585713#comment-13585713 ] Hadoop QA commented on MAPREDUCE-4502: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12570735/MAPREDUCE-4502.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:red}-1 one of tests included doesn't have a timeout.{color} {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3357//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3357//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3357//console This message is automatically generated. Multi-level aggregation with combining the result of maps per node/rack --- Key: MAPREDUCE-4502 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4502 Project: Hadoop Map/Reduce Issue Type: Improvement Components: applicationmaster, mrv2 Affects Versions: 3.0.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: design_v2.pdf, MAPREDUCE-4502.1.patch, MAPREDUCE-4502.2.patch, MAPREDUCE-4502.3.patch, MAPREDUCE-4502.4.patch, MAPREDUCE-4502.5.patch, MAPREDUCE-4525-pof.diff, speculative_draft.pdf The shuffle costs is expensive in Hadoop in spite of the existence of combiner, because the scope of combining is limited within only one MapTask. To solve this problem, it's a good way to aggregate the result of maps per node/rack by launch combiner. This JIRA is to implement the multi-level aggregation infrastructure, including combining per container(MAPREDUCE-3902 is related), coordinating containers by application master without breaking fault tolerance of jobs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4502) Multi-level aggregation with combining the result of maps per node/rack
[ https://issues.apache.org/jira/browse/MAPREDUCE-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated MAPREDUCE-4502: -- Attachment: MAPREDUCE-4502.6.patch Oops, I attached the wrong patch. This is correct one. Multi-level aggregation with combining the result of maps per node/rack --- Key: MAPREDUCE-4502 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4502 Project: Hadoop Map/Reduce Issue Type: Improvement Components: applicationmaster, mrv2 Affects Versions: 3.0.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: design_v2.pdf, MAPREDUCE-4502.1.patch, MAPREDUCE-4502.2.patch, MAPREDUCE-4502.3.patch, MAPREDUCE-4502.4.patch, MAPREDUCE-4502.5.patch, MAPREDUCE-4502.6.patch, MAPREDUCE-4525-pof.diff, speculative_draft.pdf The shuffle costs is expensive in Hadoop in spite of the existence of combiner, because the scope of combining is limited within only one MapTask. To solve this problem, it's a good way to aggregate the result of maps per node/rack by launch combiner. This JIRA is to implement the multi-level aggregation infrastructure, including combining per container(MAPREDUCE-3902 is related), coordinating containers by application master without breaking fault tolerance of jobs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-5025) Key Distribution and Management for supporting crypto codec in Map Reduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Chen updated MAPREDUCE-5025: -- Attachment: (was: MAPREDUCE-5025.patch) Key Distribution and Management for supporting crypto codec in Map Reduce - Key: MAPREDUCE-5025 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5025 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: security Affects Versions: trunk Reporter: Jerry Chen Assignee: Jerry Chen Original Estimate: 504h Remaining Estimate: 504h This task defines the work to enable Map Reduce to utilize the Crypto Codec framework to support encryption and decryption of data during MapReduce Job. According to the some real use case and discussions from the community, for encryption and decryption files in Map Reduce, we have the following requirements: 1. Different stages (input, output, intermediate output) should have the flexibility to choose whether encrypt or not, as well as which crypto codec to use. 2. Different stages may have different scheme of providing the keys. 3. Different Files (for example, different input files) may have or use different keys. 4. Support a flexible way of retrieving keys for encryption or decryption. So this task defines and provides the framework for supporting these requirements as well as the implementations for common use and key retrieving scenarios. The design document of this part is included in the Hadoop Crypto Design attached in HADOOP-9331. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-5025) Key Distribution and Management for supporting crypto codec in Map Reduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Chen updated MAPREDUCE-5025: -- Attachment: MAPREDUCE-5025.patch Key Distribution and Management for supporting crypto codec in Map Reduce - Key: MAPREDUCE-5025 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5025 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: security Affects Versions: trunk Reporter: Jerry Chen Assignee: Jerry Chen Attachments: MAPREDUCE-5025.patch Original Estimate: 504h Remaining Estimate: 504h This task defines the work to enable Map Reduce to utilize the Crypto Codec framework to support encryption and decryption of data during MapReduce Job. According to the some real use case and discussions from the community, for encryption and decryption files in Map Reduce, we have the following requirements: 1. Different stages (input, output, intermediate output) should have the flexibility to choose whether encrypt or not, as well as which crypto codec to use. 2. Different stages may have different scheme of providing the keys. 3. Different Files (for example, different input files) may have or use different keys. 4. Support a flexible way of retrieving keys for encryption or decryption. So this task defines and provides the framework for supporting these requirements as well as the implementations for common use and key retrieving scenarios. The design document of this part is included in the Hadoop Crypto Design attached in HADOOP-9331. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4502) Multi-level aggregation with combining the result of maps per node/rack
[ https://issues.apache.org/jira/browse/MAPREDUCE-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585765#comment-13585765 ] Hadoop QA commented on MAPREDUCE-4502: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12570748/MAPREDUCE-4502.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 tests included appear to have a timeout.{color} {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3358//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3358//console This message is automatically generated. Multi-level aggregation with combining the result of maps per node/rack --- Key: MAPREDUCE-4502 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4502 Project: Hadoop Map/Reduce Issue Type: Improvement Components: applicationmaster, mrv2 Affects Versions: 3.0.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: design_v2.pdf, MAPREDUCE-4502.1.patch, MAPREDUCE-4502.2.patch, MAPREDUCE-4502.3.patch, MAPREDUCE-4502.4.patch, MAPREDUCE-4502.5.patch, MAPREDUCE-4502.6.patch, MAPREDUCE-4525-pof.diff, speculative_draft.pdf The shuffle costs is expensive in Hadoop in spite of the existence of combiner, because the scope of combining is limited within only one MapTask. To solve this problem, it's a good way to aggregate the result of maps per node/rack by launch combiner. This JIRA is to implement the multi-level aggregation infrastructure, including combining per container(MAPREDUCE-3902 is related), coordinating containers by application master without breaking fault tolerance of jobs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4974) Optimising the LineRecordReader initialize() method
[ https://issues.apache.org/jira/browse/MAPREDUCE-4974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585797#comment-13585797 ] Arun A K commented on MAPREDUCE-4974: - As [~gelesh] has mentioned, we had in mind, elimination of repeated null checks, while trying to optimize the code. If it is of not much significance, please go ahead with the latest available patch containing the rest of changes. Optimising the LineRecordReader initialize() method --- Key: MAPREDUCE-4974 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4974 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mrv1, mrv2, performance Affects Versions: 2.0.2-alpha, 0.23.5 Environment: Hadoop Linux Reporter: Arun A K Assignee: Gelesh Labels: patch, performance Attachments: MAPREDUCE-4974.1.patch, MAPREDUCE-4974.2.patch, MAPREDUCE-4974.3.patch, MAPREDUCE-4974.4.patch Original Estimate: 1h Remaining Estimate: 1h I found there is a a scope of optimizing the code, over initialize() if we have compressionCodecs codec instantiated only if its a compressed input. Mean while Gelesh George Omathil, added if we could avoid the null check of key value. This would time save, since for every next key value generation, null check is done. The intention being to instantiate only once and avoid NPE as well. Hope both could be met if initialize key value over initialize() method. We both have worked on it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5006) streaming tests failing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585963#comment-13585963 ] Tom White commented on MAPREDUCE-5006: -- It's the current behaviour - in branch-1 as well - so if we want to change it, we should do it in a compatible way, e.g. with mapred.local.job.maps as you suggested. That should be a different JIRA though, and this one should fix the tests by reverting the relevant part of MAPREDUCE-4994. streaming tests failing --- Key: MAPREDUCE-5006 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5006 Project: Hadoop Map/Reduce Issue Type: Bug Components: contrib/streaming Affects Versions: 2.0.4-beta Reporter: Alejandro Abdelnur Assignee: Sandy Ryza Attachments: MAPREDUCE-5006.patch The following 2 tests are failing in trunk * org.apache.hadoop.streaming.TestStreamReduceNone * org.apache.hadoop.streaming.TestStreamXmlRecordReader -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Moved] (MAPREDUCE-5026) For shortening the time of TaskTracker heartbeat, decouple the statics collection operations
[ https://issues.apache.org/jira/browse/MAPREDUCE-5026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang moved HDFS-4527 to MAPREDUCE-5026: -- Component/s: (was: performance) tasktracker performance Fix Version/s: (was: 1.1.1) 1.1.1 Target Version/s: (was: 1.1.1) Affects Version/s: (was: 1.1.1) 1.1.1 Key: MAPREDUCE-5026 (was: HDFS-4527) Project: Hadoop Map/Reduce (was: Hadoop HDFS) For shortening the time of TaskTracker heartbeat, decouple the statics collection operations Key: MAPREDUCE-5026 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5026 Project: Hadoop Map/Reduce Issue Type: Improvement Components: performance, tasktracker Affects Versions: 1.1.1 Reporter: sam liu Labels: patch Fix For: 1.1.1 Attachments: HDFS-4527.patch Original Estimate: 24h Remaining Estimate: 24h In each heartbeat of TaskTracker, it will calculate some system statics, like the free disk space, available virtual/physical memory, cpu usage, etc. However, it's not necessary to calculate all the statics in every heartbeat, and this will consume many system resource and impace the performance of TaskTracker heartbeat. Furthermore, the characteristics of system properties(disk, memory, cpu) are different and it's better to collect their statics in different intervals. To reduce the latency of TaskTracker heartbeat, one solution is to decouple all the system statics collection operations from it, and issue separate threads to do the statics collection works when the TaskTracker starts. The threads could be three: the first one is to collect cpu related statics in a short interval; the second one is to collect memory related statics in a normal interval; the third one is to collect disk related statics in a long interval. And all the interval could be customized by the parameter mapred.stats.collection.interval in the mapred-site.xml. At last, the heartbeat could get values of system statics from the memory directly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5026) For shortening the time of TaskTracker heartbeat, decouple the statics collection operations
[ https://issues.apache.org/jira/browse/MAPREDUCE-5026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586136#comment-13586136 ] Andrew Wang commented on MAPREDUCE-5026: Hi Sam, Thanks for the patch. I moved your issue to MAPREDUCE, since the TaskTracker isn't a component of HDFS. A few minor comments: * Please rename Statics to Statistics in the code. * Could you provide some performance numbers, to quantify the before and after improvement? For shortening the time of TaskTracker heartbeat, decouple the statics collection operations Key: MAPREDUCE-5026 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5026 Project: Hadoop Map/Reduce Issue Type: Improvement Components: performance, tasktracker Affects Versions: 1.1.1 Reporter: sam liu Labels: patch Fix For: 1.1.1 Attachments: HDFS-4527.patch Original Estimate: 24h Remaining Estimate: 24h In each heartbeat of TaskTracker, it will calculate some system statics, like the free disk space, available virtual/physical memory, cpu usage, etc. However, it's not necessary to calculate all the statics in every heartbeat, and this will consume many system resource and impace the performance of TaskTracker heartbeat. Furthermore, the characteristics of system properties(disk, memory, cpu) are different and it's better to collect their statics in different intervals. To reduce the latency of TaskTracker heartbeat, one solution is to decouple all the system statics collection operations from it, and issue separate threads to do the statics collection works when the TaskTracker starts. The threads could be three: the first one is to collect cpu related statics in a short interval; the second one is to collect memory related statics in a normal interval; the third one is to collect disk related statics in a long interval. And all the interval could be customized by the parameter mapred.stats.collection.interval in the mapred-site.xml. At last, the heartbeat could get values of system statics from the memory directly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-5027) Shuffle does not limit number of outstanding connections
Jason Lowe created MAPREDUCE-5027: - Summary: Shuffle does not limit number of outstanding connections Key: MAPREDUCE-5027 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5027 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.23.5, 2.0.3-alpha Reporter: Jason Lowe The ShuffleHandler does not have any configurable limits to the number of outstanding connections allowed. Therefore a node with many map outputs and many reducers in the cluster trying to fetch those outputs can exhaust a nodemanager out of file descriptors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5027) Shuffle does not limit number of outstanding connections
[ https://issues.apache.org/jira/browse/MAPREDUCE-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586267#comment-13586267 ] Jason Lowe commented on MAPREDUCE-5027: --- AFAIK there is no built-in way to have Netty automatically limit the number of active client connections. A quick search on the net indicates one way this is handled is to simply close the extra connections as soon as they are created once we get past a specified number of active connections. Shuffle does not limit number of outstanding connections Key: MAPREDUCE-5027 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5027 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Jason Lowe The ShuffleHandler does not have any configurable limits to the number of outstanding connections allowed. Therefore a node with many map outputs and many reducers in the cluster trying to fetch those outputs can exhaust a nodemanager out of file descriptors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5027) Shuffle does not limit number of outstanding connections
[ https://issues.apache.org/jira/browse/MAPREDUCE-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586431#comment-13586431 ] Alejandro Abdelnur commented on MAPREDUCE-5027: --- The following may be handy: https://issues.jboss.org/browse/NETTY-311 Shuffle does not limit number of outstanding connections Key: MAPREDUCE-5027 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5027 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Jason Lowe The ShuffleHandler does not have any configurable limits to the number of outstanding connections allowed. Therefore a node with many map outputs and many reducers in the cluster trying to fetch those outputs can exhaust a nodemanager out of file descriptors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-5006) streaming tests failing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated MAPREDUCE-5006: -- Attachment: MAPREDUCE-5006-1.patch streaming tests failing --- Key: MAPREDUCE-5006 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5006 Project: Hadoop Map/Reduce Issue Type: Bug Components: contrib/streaming Affects Versions: 2.0.4-beta Reporter: Alejandro Abdelnur Assignee: Sandy Ryza Attachments: MAPREDUCE-5006-1.patch, MAPREDUCE-5006.patch The following 2 tests are failing in trunk * org.apache.hadoop.streaming.TestStreamReduceNone * org.apache.hadoop.streaming.TestStreamXmlRecordReader -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5006) streaming tests failing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586551#comment-13586551 ] Sandy Ryza commented on MAPREDUCE-5006: --- Ok, uploaded a patch that reverts the relevant part of MAPREDUCE-4994. streaming tests failing --- Key: MAPREDUCE-5006 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5006 Project: Hadoop Map/Reduce Issue Type: Bug Components: contrib/streaming Affects Versions: 2.0.4-beta Reporter: Alejandro Abdelnur Assignee: Sandy Ryza Attachments: MAPREDUCE-5006-1.patch, MAPREDUCE-5006.patch The following 2 tests are failing in trunk * org.apache.hadoop.streaming.TestStreamReduceNone * org.apache.hadoop.streaming.TestStreamXmlRecordReader -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5006) streaming tests failing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586572#comment-13586572 ] Hadoop QA commented on MAPREDUCE-5006: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12570900/MAPREDUCE-5006-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 tests included appear to have a timeout.{color} {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-tools/hadoop-streaming. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3359//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3359//console This message is automatically generated. streaming tests failing --- Key: MAPREDUCE-5006 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5006 Project: Hadoop Map/Reduce Issue Type: Bug Components: contrib/streaming Affects Versions: 2.0.4-beta Reporter: Alejandro Abdelnur Assignee: Sandy Ryza Attachments: MAPREDUCE-5006-1.patch, MAPREDUCE-5006.patch The following 2 tests are failing in trunk * org.apache.hadoop.streaming.TestStreamReduceNone * org.apache.hadoop.streaming.TestStreamXmlRecordReader -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5025) Key Distribution and Management for supporting crypto codec in Map Reduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586660#comment-13586660 ] Jerry Chen commented on MAPREDUCE-5025: --- [~owen.omalley] The reason we proposed file formats extended with encryption support instead of something at a lower layer is so the user can selectively apply encryption only where they feel necessary, only for those MR jobs that require it. This also helps keep the resulting files and the usage of encryption filesystem agnostic. Adding transparent encryption to a filesystem is an interesting idea and something that we also prototyped as part of this work. Perhaps a Common JIRA for an encrypting filesystem derived from FileSystem would be appropriate? Or an HDFS JIRA for plugging in compression and crypto codecs to block storage and transfer? We could look at something like those for follow on work. Key Distribution and Management for supporting crypto codec in Map Reduce - Key: MAPREDUCE-5025 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5025 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: security Affects Versions: trunk Reporter: Jerry Chen Assignee: Jerry Chen Attachments: MAPREDUCE-5025.patch Original Estimate: 504h Remaining Estimate: 504h This task defines the work to enable Map Reduce to utilize the Crypto Codec framework to support encryption and decryption of data during MapReduce Job. According to the some real use case and discussions from the community, for encryption and decryption files in Map Reduce, we have the following requirements: 1. Different stages (input, output, intermediate output) should have the flexibility to choose whether encrypt or not, as well as which crypto codec to use. 2. Different stages may have different scheme of providing the keys. 3. Different Files (for example, different input files) may have or use different keys. 4. Support a flexible way of retrieving keys for encryption or decryption. So this task defines and provides the framework for supporting these requirements as well as the implementations for common use and key retrieving scenarios. The design document of this part is included in the Hadoop Crypto Design attached in HADOOP-9331. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-5028) Maps fail when io.sort.mb is set to high value
Karthik Kambatla created MAPREDUCE-5028: --- Summary: Maps fail when io.sort.mb is set to high value Key: MAPREDUCE-5028 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5028 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.23.5, 2.0.3-alpha, 1.1.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Verified the problem exists on branch-1 with the following configuration: Pseudo-dist mode: 2 maps/ 1 reduce, mapred.child.java.opts=-Xmx2048m, io.sort.mb=1280, dfs.block.size=2147483648 Run teragen to generate 4 GB data Maps fail when you run wordcount on this configuration with the following error: {noformat} java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1031) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:692) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.hadoop.examples.WordCount$TokenizerMapper.map(WordCount.java:45) at org.apache.hadoop.examples.WordCount$TokenizerMapper.map(WordCount.java:34) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:38) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116) at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1505) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1438) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:855) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1346) {noformat} Marked branch-0.23 and branch-2 also because the offending code seems to exist there too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-5028) Maps fail when io.sort.mb is set to high value
[ https://issues.apache.org/jira/browse/MAPREDUCE-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586708#comment-13586708 ] Karthik Kambatla commented on MAPREDUCE-5028: - http://comments.gmane.org/gmane.comp.java.hadoop.mapreduce.user/2485 looks like the same issue. Maps fail when io.sort.mb is set to high value -- Key: MAPREDUCE-5028 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5028 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 1.1.1, 2.0.3-alpha, 0.23.5 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Verified the problem exists on branch-1 with the following configuration: Pseudo-dist mode: 2 maps/ 1 reduce, mapred.child.java.opts=-Xmx2048m, io.sort.mb=1280, dfs.block.size=2147483648 Run teragen to generate 4 GB data Maps fail when you run wordcount on this configuration with the following error: {noformat} java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1031) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:692) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.hadoop.examples.WordCount$TokenizerMapper.map(WordCount.java:45) at org.apache.hadoop.examples.WordCount$TokenizerMapper.map(WordCount.java:34) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:38) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116) at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1505) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1438) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:855) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1346) {noformat} Marked branch-0.23 and branch-2 also because the offending code seems to exist there too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-5028) Maps fail when io.sort.mb is set to high value
[ https://issues.apache.org/jira/browse/MAPREDUCE-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated MAPREDUCE-5028: Attachment: mr-5028-branch1.patch Uploading a patch that fixes the issue. Turns out buggy use of {{DataInputBuffer.reset()}} and {{DataInputBuffer.getLength()}} was leading to integer overflow at these considerable large values of io.sort.mb. {{getLength()}} returns the size of the entire buffer, and not just the remaining part of the buffer. The offending code assumes otherwise leading to this issue. For instance, if a key is at position 1,224,906,830 extending to 1,224,906,868. The length of this key should be set to 38 - the code instead sets it to 1,224,906,868. The data buffer interprets this key to end at 2,449,813,698 which is bigger than 2^31-1 leading to a negative value!! For something like this to happen, the starting position of the key should be large ~ (2^31 - 1 - key_size)/2 = 1073741823.5 - key_size/2. The reported 1280 MB is larger than this, and hence we see the issue. The patch fixes the use of reset() and getLength() as required, and also updates the javadoc of getLength(). Verified that the patch fixes the problem for the same cases as in the description. Not sure how to write a test for this. Maps fail when io.sort.mb is set to high value -- Key: MAPREDUCE-5028 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5028 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 1.1.1, 2.0.3-alpha, 0.23.5 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: mr-5028-branch1.patch Verified the problem exists on branch-1 with the following configuration: Pseudo-dist mode: 2 maps/ 1 reduce, mapred.child.java.opts=-Xmx2048m, io.sort.mb=1280, dfs.block.size=2147483648 Run teragen to generate 4 GB data Maps fail when you run wordcount on this configuration with the following error: {noformat} java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1031) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:692) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.hadoop.examples.WordCount$TokenizerMapper.map(WordCount.java:45) at org.apache.hadoop.examples.WordCount$TokenizerMapper.map(WordCount.java:34) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:38) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116) at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1505) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1438) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:855) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1346) {noformat} Marked branch-0.23 and branch-2 also because the offending code seems to exist there too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-5029) Recursively take all files in the directories of a root directory
Abhilash S R created MAPREDUCE-5029: --- Summary: Recursively take all files in the directories of a root directory Key: MAPREDUCE-5029 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5029 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 0.20.2 Reporter: Abhilash S R Suppose we have a root directories with 1000's of sub directories and in each directory there can be 100's of files.So while specifying the root directory in the input path in map-reduce the program crashes due to sub directories in the root directory.So if this feature is includes in latest version it will be great helpful for programers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4974) Optimising the LineRecordReader initialize() method
[ https://issues.apache.org/jira/browse/MAPREDUCE-4974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586823#comment-13586823 ] Surenkumar Nihalani commented on MAPREDUCE-4974: The one line of code that seems to be missing is key.set(pos). where is that being handled? Optimising the LineRecordReader initialize() method --- Key: MAPREDUCE-4974 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4974 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mrv1, mrv2, performance Affects Versions: 2.0.2-alpha, 0.23.5 Environment: Hadoop Linux Reporter: Arun A K Assignee: Gelesh Labels: patch, performance Attachments: MAPREDUCE-4974.1.patch, MAPREDUCE-4974.2.patch, MAPREDUCE-4974.3.patch, MAPREDUCE-4974.4.patch Original Estimate: 1h Remaining Estimate: 1h I found there is a a scope of optimizing the code, over initialize() if we have compressionCodecs codec instantiated only if its a compressed input. Mean while Gelesh George Omathil, added if we could avoid the null check of key value. This would time save, since for every next key value generation, null check is done. The intention being to instantiate only once and avoid NPE as well. Hope both could be met if initialize key value over initialize() method. We both have worked on it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira