[jira] [Updated] (MAPREDUCE-4882) Error in estimating the length of the output file in Spill Phase
[ https://issues.apache.org/jira/browse/MAPREDUCE-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated MAPREDUCE-4882: - Resolution: Duplicate Fix Version/s: 2.6.0 Target Version/s: (was: ) Status: Resolved (was: Patch Available) Fixed in MAPREDUCE-6063. Sorry Jerry; didn't see this. Error in estimating the length of the output file in Spill Phase Key: MAPREDUCE-4882 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4882 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.20.2, 1.0.3 Environment: Any Environment Reporter: Lijie Xu Assignee: Jerry Chen Labels: BB2015-05-TBR, patch Fix For: 2.6.0 Attachments: MAPREDUCE-4882.patch Original Estimate: 1h Remaining Estimate: 1h The sortAndSpill() method in MapTask.java has an error in estimating the length of the output file. The long size should be (bufvoid - bufstart) + bufend not (bufvoid - bufend) + bufstart when bufend bufstart. Here is the original code in MapTask.java. private void sortAndSpill() throws IOException, ClassNotFoundException, InterruptedException { //approximate the length of the output file to be the length of the //buffer + header lengths for the partitions long size = (bufend = bufstart ? bufend - bufstart : (bufvoid - bufend) + bufstart) + partitions * APPROX_HEADER_LENGTH; FSDataOutputStream out = null; -- I had a test on TeraSort. A snippet from mapper's log is as follows: MapTask: Spilling map output: record full = true MapTask: bufstart = 157286200; bufend = 10485460; bufvoid = 199229440 MapTask: kvstart = 262142; kvend = 131069; length = 655360 MapTask: Finished spill 3 In this occasioin, Spill Bytes should be (199229440 - 157286200) + 10485460 = 52428700 (52 MB) because the number of spilled records is 524287 and each record costs 100B. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-4882) Error in estimating the length of the output file in Spill Phase
[ https://issues.apache.org/jira/browse/MAPREDUCE-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated MAPREDUCE-4882: Labels: BB2015-05-TBR patch (was: patch) Error in estimating the length of the output file in Spill Phase Key: MAPREDUCE-4882 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4882 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.20.2, 1.0.3 Environment: Any Environment Reporter: Lijie Xu Assignee: Jerry Chen Labels: BB2015-05-TBR, patch Attachments: MAPREDUCE-4882.patch Original Estimate: 1h Remaining Estimate: 1h The sortAndSpill() method in MapTask.java has an error in estimating the length of the output file. The long size should be (bufvoid - bufstart) + bufend not (bufvoid - bufend) + bufstart when bufend bufstart. Here is the original code in MapTask.java. private void sortAndSpill() throws IOException, ClassNotFoundException, InterruptedException { //approximate the length of the output file to be the length of the //buffer + header lengths for the partitions long size = (bufend = bufstart ? bufend - bufstart : (bufvoid - bufend) + bufstart) + partitions * APPROX_HEADER_LENGTH; FSDataOutputStream out = null; -- I had a test on TeraSort. A snippet from mapper's log is as follows: MapTask: Spilling map output: record full = true MapTask: bufstart = 157286200; bufend = 10485460; bufvoid = 199229440 MapTask: kvstart = 262142; kvend = 131069; length = 655360 MapTask: Finished spill 3 In this occasioin, Spill Bytes should be (199229440 - 157286200) + 10485460 = 52428700 (52 MB) because the number of spilled records is 524287 and each record costs 100B. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-4882) Error in estimating the length of the output file in Spill Phase
[ https://issues.apache.org/jira/browse/MAPREDUCE-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Chen updated MAPREDUCE-4882: -- Attachment: MAPREDUCE-4882.patch Error in estimating the length of the output file in Spill Phase Key: MAPREDUCE-4882 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4882 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.20.2, 1.0.3 Environment: Any Environment Reporter: Lijie Xu Assignee: Jerry Chen Labels: patch Attachments: MAPREDUCE-4882.patch Original Estimate: 1h Remaining Estimate: 1h The sortAndSpill() method in MapTask.java has an error in estimating the length of the output file. The long size should be (bufvoid - bufstart) + bufend not (bufvoid - bufend) + bufstart when bufend bufstart. Here is the original code in MapTask.java. private void sortAndSpill() throws IOException, ClassNotFoundException, InterruptedException { //approximate the length of the output file to be the length of the //buffer + header lengths for the partitions long size = (bufend = bufstart ? bufend - bufstart : (bufvoid - bufend) + bufstart) + partitions * APPROX_HEADER_LENGTH; FSDataOutputStream out = null; -- I had a test on TeraSort. A snippet from mapper's log is as follows: MapTask: Spilling map output: record full = true MapTask: bufstart = 157286200; bufend = 10485460; bufvoid = 199229440 MapTask: kvstart = 262142; kvend = 131069; length = 655360 MapTask: Finished spill 3 In this occasioin, Spill Bytes should be (199229440 - 157286200) + 10485460 = 52428700 (52 MB) because the number of spilled records is 524287 and each record costs 100B. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4882) Error in estimating the length of the output file in Spill Phase
[ https://issues.apache.org/jira/browse/MAPREDUCE-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Chen updated MAPREDUCE-4882: -- Target Version/s: trunk (was: 0.20.2, 1.0.3) Status: Patch Available (was: Open) Patch for fixing the problem attached. Change from (bufvoid - bufend) + bufstart to (bufvoid - bufstart) + bufend and add test case for detecting invalid estimation size as for the case of bufend bufstart, (bufvoid - bufend) + bufstart will greater than bufvoid. Please kindly help review the patch. Error in estimating the length of the output file in Spill Phase Key: MAPREDUCE-4882 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4882 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 1.0.3, 0.20.2 Environment: Any Environment Reporter: Lijie Xu Assignee: Jerry Chen Labels: patch Attachments: MAPREDUCE-4882.patch Original Estimate: 1h Remaining Estimate: 1h The sortAndSpill() method in MapTask.java has an error in estimating the length of the output file. The long size should be (bufvoid - bufstart) + bufend not (bufvoid - bufend) + bufstart when bufend bufstart. Here is the original code in MapTask.java. private void sortAndSpill() throws IOException, ClassNotFoundException, InterruptedException { //approximate the length of the output file to be the length of the //buffer + header lengths for the partitions long size = (bufend = bufstart ? bufend - bufstart : (bufvoid - bufend) + bufstart) + partitions * APPROX_HEADER_LENGTH; FSDataOutputStream out = null; -- I had a test on TeraSort. A snippet from mapper's log is as follows: MapTask: Spilling map output: record full = true MapTask: bufstart = 157286200; bufend = 10485460; bufvoid = 199229440 MapTask: kvstart = 262142; kvend = 131069; length = 655360 MapTask: Finished spill 3 In this occasioin, Spill Bytes should be (199229440 - 157286200) + 10485460 = 52428700 (52 MB) because the number of spilled records is 524287 and each record costs 100B. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira