[jira] Commented: (PIG-920) optimizing diamond queries

2009-10-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772066#action_12772066
 ] 

Richard Ding commented on PIG-920:
--

Add additional comments.

> optimizing diamond queries
> --
>
> Key: PIG-920
> URL: https://issues.apache.org/jira/browse/PIG-920
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Attachments: PIG-920.patch, PIG-920.patch
>
>
> The following query
> A = load 'foo';
> B = filer A by $0>1;
> C = filter A by $1 = 'foo';
> D = COGROUP C by $0, B by $0;
> ..
> does not get efficiently executed. Currently, it runs a map only job that 
> basically reads and write the same data before doing the query processing.
> Query where the data is loaded twice actually executed more efficiently.
> This is not an uncommon query and we should fix this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-920) optimizing diamond queries

2009-10-29 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771727#action_12771727
 ] 

Richard Ding commented on PIG-920:
--

bq. It would be good to add some comments in the following code on why the plan 
size should be 2 or 3 and what the POForEach is

Will do.

bq. Just to be safe it might be better to check that there is only 1 successor 
before this code:

The load operator can have only one successor (supportsMultipleOutputs = false).

bq. Is the following by design even in the case where multiple successors are 
present for splitter?

This return value is the number of MR operators being merged (removed from plan 
by this method). For this method, the return value can be either 0 or 1.


> optimizing diamond queries
> --
>
> Key: PIG-920
> URL: https://issues.apache.org/jira/browse/PIG-920
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Attachments: PIG-920.patch
>
>
> The following query
> A = load 'foo';
> B = filer A by $0>1;
> C = filter A by $1 = 'foo';
> D = COGROUP C by $0, B by $0;
> ..
> does not get efficiently executed. Currently, it runs a map only job that 
> basically reads and write the same data before doing the query processing.
> Query where the data is loaded twice actually executed more efficiently.
> This is not an uncommon query and we should fix this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-920) optimizing diamond queries

2009-10-29 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771712#action_12771712
 ] 

Pradeep Kamath commented on PIG-920:


In MultiQueryOptimizer.java (the numbers in the code blocks below are line 
numbers):

It would be good to add some comments in the following code on why the plan 
size should be 2 or 3
and what the POForEach is
{noformat}
 223 if (pl.size() == 2 || pl.size() == 3) {
   224 PhysicalOperator root = pl.getRoots().get(0);
   225 PhysicalOperator leaf = pl.getLeaves().get(0);
   226 if (root instanceof POLoad && leaf instanceof POStore) {
   227 if (pl.size() == 3) {
   228 PhysicalOperator mid = 
pl.getSuccessors(root).get(0);
   229 if (mid instanceof POForEach) {
   230 rtn = true;
   231 }
   232 } else {
   233 rtn = true;
   234 }
   235 }
   236 }
   237 }
{noformat}


Just to be safe it might be better to check that there is only 1 successor 
before this code:
{noformat}
 265 PhysicalOperator opSucc = 
succ.mapPlan.getSuccessors(op).get(0);
{noformat}

Is the following by design even in the case where multiple successors are 
present for splitter?
{noformat}
 309 return 1;
{noformat}


> optimizing diamond queries
> --
>
> Key: PIG-920
> URL: https://issues.apache.org/jira/browse/PIG-920
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Attachments: PIG-920.patch
>
>
> The following query
> A = load 'foo';
> B = filer A by $0>1;
> C = filter A by $1 = 'foo';
> D = COGROUP C by $0, B by $0;
> ..
> does not get efficiently executed. Currently, it runs a map only job that 
> basically reads and write the same data before doing the query processing.
> Query where the data is loaded twice actually executed more efficiently.
> This is not an uncommon query and we should fix this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-920) optimizing diamond queries

2009-10-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770800#action_12770800
 ] 

Hadoop QA commented on PIG-920:
---

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12423365/PIG-920.patch
  against trunk revision 830335.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/118/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/118/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/118/console

This message is automatically generated.

> optimizing diamond queries
> --
>
> Key: PIG-920
> URL: https://issues.apache.org/jira/browse/PIG-920
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Attachments: PIG-920.patch
>
>
> The following query
> A = load 'foo';
> B = filer A by $0>1;
> C = filter A by $1 = 'foo';
> D = COGROUP C by $0, B by $0;
> ..
> does not get efficiently executed. Currently, it runs a map only job that 
> basically reads and write the same data before doing the query processing.
> Query where the data is loaded twice actually executed more efficiently.
> This is not an uncommon query and we should fix this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.