[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-13 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777589#action_12777589
 ] 

Daniel Dai commented on PIG-1038:
-

Continue with the last comment.

4. Strip secondary keys from the value

5. Write a byte version of OutputKeyComparator

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, 
 PIG-1038-4.patch, PIG-1038-5.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-12 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777318#action_12777318
 ] 

Daniel Dai commented on PIG-1038:
-

Couple of limitations for the currently implementation and will be addressed 
later:
1. If sort is not nested inside foreach plan, it will not optimized. That's the 
issue with merge join indexing which Ashutosh find.

2. All the distinct keys are assumed to be ascending sort. Actually for the 
distinct key, the order of sort is flexible, if descending sort is cheaper, we 
shall use descending sort. Eg:
{code}
C = foreach B { C1 = order A by $0 desc; C2 = C1.$0; C3 = distinct C2; 
generate group, C3; }
{code}
Both order by and distinct is on the same key A.$0; however, order by use 
descending order. If we use descending A.$0 as secondary key, we shall able to 
remove both order by and distinct. This is not the case now. We can only remove 
order by and leave distinct. 

3. Main key has the same issue. Main key is group key and order does not 
matters. However, in current implementation, we assume ascending order. Eg:
{code}
B = group a by (a0, a1);
C = foreach B { C1 = order A by a0 desc; generate group, C1; }
{code}
We use (a0, a1) as the main key, and nested order by cannot be removed. 
However, if we use reverse the order of main key, then we can remove the order 
by.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, 
 PIG-1038-4.patch, PIG-1038-5.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-11 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776729#action_12776729
 ] 

Pradeep Kamath commented on PIG-1038:
-

In JobControlCompiler:
==
{code}
583 jobConf.set(pig.secondarySortOrder,
584 
ObjectSerializer.serialize(mro.getSecondarySortOrder()));
585 }
{code}
Looks like above code should be set in the case of non order by Mro which uses
secondary key

{code}
638 valuea = ((Tuple)wa.getValueAsPigType()).get(0);
{code}
We should put a comment explaining that we extract the first field out since 
that represents
the actual group by key.

In SecondaryKeyOptimizer:

{code}
} else if (mapLeaf instanceof POUnion || mapLeaf instanceof 
POSplit) {
{code}

The above should not contain POSplit since POSplit would only occur after multi 
query optimization
which happens later.

{code}
94 } else if (plan.getRoots().size() != 1) {
 95 // POLocalRearrange plan contains more than 1 root.
 96 // Probably there is an Expression operator in the local
 97 // rearrangement plan,
 98 // skip secondary key optimizing
 99 return null;
{code}
Should we do continue nextPlan instead of return null here since this is 
similar to udf or constant in 
local rearrange case

{code}
105.columnChainInfo.insert(false, columns, DataType.TUPLE);
{code}
It would useful to put a comment explaining this is put into the 
ColumnChainInfo only for comparing that
different components of SortKeyInfo are all coming from the same input index. 
Also should the datatype be
BAG?

{code}
118log.debug(node +  have more than 1 predecessor);
{code}
predecessor should change to successor.

{code}
217 if (currentNode instanceof POPackage
218 || currentNode instanceof POFilter
219 || currentNode instanceof POLimit) {
{code}
In line 217 we should ensure, we don't optimize when we encounter POJoinPackage 
using something like
{code}
if ((currentNode instanceof POPackage  !(currentNode instanceof 
POJoinPackage))
{code}

{code}
307. int errorCode = 1000;
327 int errorCode = 1000;
526. int errorCode = 1000;
{code}
This error code is already in use 

{code}
336 } else if (mapLeaf instanceof POUnion || mapLeaf instanceof 
POSplit) {
337 ListPhysicalOperator preds = mr.mapPlan
338 .getPredecessors(mapLeaf);
339 for (PhysicalOperator pred : preds) {
340 POLocalRearrange rearrange = (POLocalRearrange) pred;
341 rearrange.setUseSecondaryKey(true);
342 if (rearrange.getIndex() == indexOfRearrangeToChange) 
// Try
343   
// to
344   
// find
345   
// the
346   
// POLocalRearrange
347   
// for
348   
// the
349   
// secondary
350   
// key
351 setSecondaryPlan(mr.mapPlan, rearrange,
352 secondarySortKeyInfo);
353 }   
354 }
{code}
The above should not contain POSplit since POSplit would only occur after multi 
query optimization
which happens later.

Also in the if statement on line 342, what if the condition evaluates to false 
- should we throw an Exception like earlier in the same
method?

{code}
530 if (r)
531 sawInvalidPhysicalOper = true;
..
557 if (r) // if we saw physical operator other than project in 
sort
558// plan
559 return;
{code}
At line 559 should we be setting sawInvalidPhysicalOper?

General comments:
=
A comment on ColumnChainInfo and SortKeyInfo explaining how it tracks to 
POProjects in the plan would be useful

POMultiQueryPackage should not change since SecondaryKeyOptimizer runs before
MultiQueryOptimizer.




 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  

[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776821#action_12776821
 ] 

Hadoop QA commented on PIG-1038:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12424677/PIG-1038-4.patch
  against trunk revision 835005.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 7 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 209 javac compiler warnings (more 
than the trunk's current 199 warnings).

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

-1 release audit.  The applied patch generated 320 release audit warnings 
(more than the trunk's current 318 warnings).

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/44/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/44/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/44/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/44/console

This message is automatically generated.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, 
 PIG-1038-4.patch, PIG-1038-5.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-11 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776853#action_12776853
 ] 

Pradeep Kamath commented on PIG-1038:
-

Changes look good. One observation is in SecondaryKeyOptimizer.java:
{code}
 if (r) // if we saw physical operator other than project in 
sort
// plan
 return;
{code}
 should we be setting sawInvalidPhysicalOper?

Other than that, +1 - please commit after making any change if required for the 
above.


 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, 
 PIG-1038-4.patch, PIG-1038-5.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-10 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775969#action_12775969
 ] 

Ashutosh Chauhan commented on PIG-1038:
---

Another place where Hadoop's secondary sort is useful in Pig is to sort the 
index entries for Merge Join. In indexing job of Merge Join, index entries 
sampled from map tasks are grouped in one reduce task where they are sorted 
before being written to disk. Currently, Pig does the sorting, but Hadoop's 
secondary sort can be used instead. This may not result in much performance 
gains since index is small in any case, but this may be a good test case for 
secondary key optimization. This depends on how you are discovering the pattern 
as I asked in previous question. If there is POSort immediately following 
POPackage or POJoinPackage in reducer and some other conditions are met we can 
apply Secondary key sorting optimization.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch, PIG-1038-2.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-10 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775982#action_12775982
 ] 

Ashutosh Chauhan commented on PIG-1038:
---

Note that in the use-case I mentioned of sorting index entries (btw, it can 
appear in user query as well) there is no POForEach , but secondary sort can 
still be applied and will be useful.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch, PIG-1038-2.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-10 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776175#action_12776175
 ] 

Daniel Dai commented on PIG-1038:
-

Hi, Ashutosh,
Good point. However, it is hard to get in for this release due to the schedule. 
You can open a separate Jira and we can address it afterward.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch, PIG-1038-2.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12774926#action_12774926
 ] 

Hadoop QA commented on PIG-1038:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12424332/PIG-1038-2.patch
  against trunk revision 833549.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 205 javac compiler warnings (more 
than the trunk's current 199 warnings).

-1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

-1 release audit.  The applied patch generated 319 release audit warnings 
(more than the trunk's current 317 warnings).

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/145/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/145/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/145/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/145/console

This message is automatically generated.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch, PIG-1038-2.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12774753#action_12774753
 ] 

Hadoop QA commented on PIG-1038:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12424289/PIG-1038-1.patch
  against trunk revision 833549.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 207 javac compiler warnings (more 
than the trunk's current 199 warnings).

-1 findbugs.  The patch appears to introduce 3 new Findbugs warnings.

-1 release audit.  The applied patch generated 319 release audit warnings 
(more than the trunk's current 317 warnings).

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/143/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/143/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/143/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/143/console

This message is automatically generated.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1038-1.patch


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-02 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772562#action_12772562
 ] 

Daniel Dai commented on PIG-1038:
-

Hi, Ashutosh,
I will look into POForeach and find the first nested sort or distinct, and use 
this sort/distinct key as the secondary sort key for this map-reduce job. So 
that I can take away/simplify the nested sort/distinct.

Yes, we definitely need a framework for the map-reduce layer also. We will work 
on that, and welcome any suggestions and comments.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-02 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772774#action_12772774
 ] 

Alan Gates commented on PIG-1038:
-

I agree that we need a framework for optimizations in the backend.  I'm hoping 
we can reuse the framework from the front end.  However, there's some cleanup 
we'd still like to do on the LogicalOptimizer before we use it as a template 
for a MapReduceOptimizer.  But I agree that's where we need to go.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772404#action_12772404
 ] 

Ashutosh Chauhan commented on PIG-1038:
---

I think its a useful optimization. I presume this will be implemented as a 
visitor in MapReduceLauncher which visits on compiled MR plan. Design looks 
good. I have few questions:

bq. 1.1 Discover if we use sort/distinct in nested foreach plan.
How are you planning to discover ? Depending on some pattern like LR in 
map-plan followed by POPackage, POForeach, POSort  in reduce-plan?

Kind of orthogonal but related to this issue. We have rule-based optimizer 
framework in front-end, it seems to me that similar optimizer framework is 
required in backend too to refactor all the optimizer visitors we currently 
have and to add  similar kind of optimizations easily in future. 
There are seven optimizations in front-end expressed through rules. On the 
other hand after addition of this one we will have nine optimization visitors 
in backend. May be we can think about it to avoid lot of rework every time such 
optimization is added.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-10-30 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772085#action_12772085
 ] 

Daniel Dai commented on PIG-1038:
-

Here is the design for this optimization:
1. Add SecondaryKeyOptimizer, which optimize map-reduce plan. It will
1.1 Discover if we use sort/distinct in nested foreach plan. 
1.2 For the first such sort/distinct, use the sort/distinct key as the 
secondary key
1.3 Once SecondaryKeyOptimizer discover secondary key, it will call 
POLocalRearrange.setSecondaryPlan, then drop sort or simplify distinct

2. Change POLocalRearrange
2.1 Add setSecondaryPlan to provide a way to set secondary plan for 
SecondaryKeyOptimizer
2.2 Change constructLROutput to make a compound key, which is a tuple: (key, 
secondaryKey)
2.3 We need to duplicate the logic to strip key from values for the secondary 
key as well

3. Change POPackageAnnotator to patch POPackage with the keyinfo from both key 
and secondaryKey

4. Change POPackage to stitch secondary key to the value

5. Change MapReduceOper to indicate that map-reduce operator needs secondary 
key, and JobControlCompiler will set OutputValueGroupingComparator to use the 
mainKeyComparator

6. Add mainKeyComparator which inherits PigNullableWritable and only compare 
the main key. We need that for the OutputValueGroupingComparator

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.