[ https://issues.apache.org/jira/browse/PIG-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762425#action_12762425 ]
Hadoop QA commented on PIG-983: ------------------------------- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12421322/PIG-983.patch against trunk revision 821101. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/13/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/13/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/13/console This message is automatically generated. > PERFORMANCE: multi-query optimization on multiple group bys following a join > or cogroup > --------------------------------------------------------------------------------------- > > Key: PIG-983 > URL: https://issues.apache.org/jira/browse/PIG-983 > Project: Pig > Issue Type: Improvement > Components: impl > Reporter: Richard Ding > Assignee: Richard Ding > Attachments: PIG-983.patch > > > The current multi-query optimizer works well with pig scripts like this one: > {code} > data = LOAD 'input' AS (a:chararray, b:int, c:int); > A = GROUP data BY b; > B = GROUP data BY c; > C = FOREACH A GENERATE group, COUNT(data); > D = FOREACH B GENERATE group, SUM(data.b); > STORE C INTO 'output1'; > STORE D INTO 'output2'; > {code} > In this case the original three Map-Reduce jobs are merged into one MR job by > the optimizer. > The current optimizer, however, won't reduce the number of MR jobs for the > scripts in which multiple group bys follow a join or a cogroup, such as this > one: > {code} > data1 = LOAD 'input1' AS (a1:chararray, b1:int, c1:int); > data2 = LOAD 'input2' AS (a2:chararray, b2:int, c2:int); > A = JOIN data1 BY a1, data2 BY a2; > B = GROUP A BY data1::b1; > C = GROUP B BY data2::c2; > D = FOREACH B GENERATE group, COUNT(A); > E = FOREACH C GENERATE group, SUM(A.data2::b2); > STORE D INTO 'output1'; > STORE E INTO 'output2'; > {code} > Three MR jobs are still needed to run this script. > Multi-query optimizer should work with this kind of scripts by merging the > group bys and reducing the overall MR jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.