[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs
[ https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799370#action_12799370 ] Ying He commented on PIG-480: - I did some tests with larger data set, and the results are consistent with what we saw before. I didn't run skewed data with no combiner, because it kept running out of space. 1. skewed data combiner job 1 job 2 total patch46min 3min 38sec 49min 38sec trunk 24min 32sec6min 53sec 31min 25sec combiner and skewed join patch6min 40sec3min 58sec10min 38sec trunk8min 41sec8min 32sec17min 13sec 2. uniform data combiner patch 13min 18sec 7min 9sec 20min 27sec trunk 19min 1sec13min 25sec32min 26sec no combiner patch 18min 21sec 37min 4sec 55min 25sec trunk 16min 31sec 40min 3sec 56min 34sec PERFORMANCE: Use identity mapper in a chain of M-R jobs --- Key: PIG-480 URL: https://issues.apache.org/jira/browse/PIG-480 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Ying He Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch For jobs with two or more MR jobs, use identity mapper wherever possible in second and subsequent MR jobs. Identity mapper is about 50% than pig empty map job because it doesn't parse the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs
[ https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799373#action_12799373 ] Alan Gates commented on PIG-480: So this code definitely wins in some instances and looses in others. I propose that we do include the functionality, but that we define a property that will turn it off (something like -Dpig.exec.noidentitymap or something) and clearly document the case where users would want to turn it off. Thoughts? PERFORMANCE: Use identity mapper in a chain of M-R jobs --- Key: PIG-480 URL: https://issues.apache.org/jira/browse/PIG-480 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Ying He Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch For jobs with two or more MR jobs, use identity mapper wherever possible in second and subsequent MR jobs. Identity mapper is about 50% than pig empty map job because it doesn't parse the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs
[ https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12799376#action_12799376 ] Ying He commented on PIG-480: - the option to turn it off is already there. Use -Dopt.identitymap=false to turn it off. PERFORMANCE: Use identity mapper in a chain of M-R jobs --- Key: PIG-480 URL: https://issues.apache.org/jira/browse/PIG-480 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Ying He Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch For jobs with two or more MR jobs, use identity mapper wherever possible in second and subsequent MR jobs. Identity mapper is about 50% than pig empty map job because it doesn't parse the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs
[ https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797412#action_12797412 ] Ying He commented on PIG-480: - I did more performance tests. It shows the performance is related to the nature of data. If the data is skewed, performance is very bad for combiner case. If data is uniform, the combiner case gets the most performance gain. The test is done by using a join then a group by statement. For skewed data, if I use skewed join, the result is much better. I think the reason of bad performance for skewed data is that because the map plan of second job is moved to the reducer of first job. If data is skewed, a single reducer has to execute the extra logic for all its tuples. While without this patch, that part of logic would be executed inside multiple mappers. So we lost parallelism for this. The more skewed the data is, the worse the performance would be. 1. skewed data combiner job 1 job 2 total patch 7min 53sec 1min 1sec8min 54sec trunk 4min 43sec 1min 37sec 6min 20sec combiner and using skewed join patch1min 55sec 1min 1sec 2min 56sec trunk1min 44sec 1min 40sec 3min 24sec no combiner patch2min 26sec 2min 28sec 4min 54sec trunk1min 25sec 3min 24sec 4min 49sec no combiner and using skewed join patch 1min 17sec 3min 5sec 4min 22sec trunk59sec 3min 7sec 4min 6sec 2. uniform data combiner patch 6min 48sec 3min 43sec10min 31sec trunk7min 32sec 7min 3sec 14min 35sec no combiner patch 1min 25sec 2min 25sec 3min 50sec trunk 1min 24sec 2min 28sec 3min 52sec each group of tests may use different data, so don't make cross group comparison. PERFORMANCE: Use identity mapper in a chain of M-R jobs --- Key: PIG-480 URL: https://issues.apache.org/jira/browse/PIG-480 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Ying He Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch For jobs with two or more MR jobs, use identity mapper wherever possible in second and subsequent MR jobs. Identity mapper is about 50% than pig empty map job because it doesn't parse the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs
[ https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797518#action_12797518 ] Hadoop QA commented on PIG-480: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12429598/PIG_480.patch against trunk revision 896606. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 230 javac compiler warnings (more than the trunk's current 212 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 482 release audit warnings (more than the trunk's current 481 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/168/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/168/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/168/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/168/console This message is automatically generated. PERFORMANCE: Use identity mapper in a chain of M-R jobs --- Key: PIG-480 URL: https://issues.apache.org/jira/browse/PIG-480 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Ying He Attachments: PIG_480.patch, PIG_480.patch, PIG_480.patch For jobs with two or more MR jobs, use identity mapper wherever possible in second and subsequent MR jobs. Identity mapper is about 50% than pig empty map job because it doesn't parse the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs
[ https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792569#action_12792569 ] Alan Gates commented on PIG-480: What kind of performance gain do we get from this? The only PigMIx query that looks like it would be directly affected is PigMix_3. It would be interesting to run that and a few other queries that we expect would benefit from this to measure the performance improvements. PERFORMANCE: Use identity mapper in a chain of M-R jobs --- Key: PIG-480 URL: https://issues.apache.org/jira/browse/PIG-480 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Ying He Attachments: PIG_480.patch, PIG_480.patch For jobs with two or more MR jobs, use identity mapper wherever possible in second and subsequent MR jobs. Identity mapper is about 50% than pig empty map job because it doesn't parse the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs
[ https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787060#action_12787060 ] Ying He commented on PIG-480: - The javac warnings are caused by the references to hadoop deprecated API. The release audit warning is for html file. PERFORMANCE: Use identity mapper in a chain of M-R jobs --- Key: PIG-480 URL: https://issues.apache.org/jira/browse/PIG-480 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Ying He Attachments: PIG_480.patch, PIG_480.patch For jobs with two or more MR jobs, use identity mapper wherever possible in second and subsequent MR jobs. Identity mapper is about 50% than pig empty map job because it doesn't parse the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-480) PERFORMANCE: Use identity mapper in a chain of M-R jobs
[ https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785801#action_12785801 ] Hadoop QA commented on PIG-480: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426804/PIG_480.patch against trunk revision 887049. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 217 javac compiler warnings (more than the trunk's current 213 warnings). -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/88/testReport/ Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/88/console This message is automatically generated. PERFORMANCE: Use identity mapper in a chain of M-R jobs --- Key: PIG-480 URL: https://issues.apache.org/jira/browse/PIG-480 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Ying He Attachments: PIG_480.patch For jobs with two or more MR jobs, use identity mapper wherever possible in second and subsequent MR jobs. Identity mapper is about 50% than pig empty map job because it doesn't parse the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.