[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777589#action_12777589 ] Daniel Dai commented on PIG-1038: - Continue with the last comment. 4. Strip secondary keys from the value 5. Write a byte version of OutputKeyComparator Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, PIG-1038-4.patch, PIG-1038-5.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777318#action_12777318 ] Daniel Dai commented on PIG-1038: - Couple of limitations for the currently implementation and will be addressed later: 1. If sort is not nested inside foreach plan, it will not optimized. That's the issue with merge join indexing which Ashutosh find. 2. All the distinct keys are assumed to be ascending sort. Actually for the distinct key, the order of sort is flexible, if descending sort is cheaper, we shall use descending sort. Eg: {code} C = foreach B { C1 = order A by $0 desc; C2 = C1.$0; C3 = distinct C2; generate group, C3; } {code} Both order by and distinct is on the same key A.$0; however, order by use descending order. If we use descending A.$0 as secondary key, we shall able to remove both order by and distinct. This is not the case now. We can only remove order by and leave distinct. 3. Main key has the same issue. Main key is group key and order does not matters. However, in current implementation, we assume ascending order. Eg: {code} B = group a by (a0, a1); C = foreach B { C1 = order A by a0 desc; generate group, C1; } {code} We use (a0, a1) as the main key, and nested order by cannot be removed. However, if we use reverse the order of main key, then we can remove the order by. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, PIG-1038-4.patch, PIG-1038-5.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776729#action_12776729 ] Pradeep Kamath commented on PIG-1038: - In JobControlCompiler: == {code} 583 jobConf.set(pig.secondarySortOrder, 584 ObjectSerializer.serialize(mro.getSecondarySortOrder())); 585 } {code} Looks like above code should be set in the case of non order by Mro which uses secondary key {code} 638 valuea = ((Tuple)wa.getValueAsPigType()).get(0); {code} We should put a comment explaining that we extract the first field out since that represents the actual group by key. In SecondaryKeyOptimizer: {code} } else if (mapLeaf instanceof POUnion || mapLeaf instanceof POSplit) { {code} The above should not contain POSplit since POSplit would only occur after multi query optimization which happens later. {code} 94 } else if (plan.getRoots().size() != 1) { 95 // POLocalRearrange plan contains more than 1 root. 96 // Probably there is an Expression operator in the local 97 // rearrangement plan, 98 // skip secondary key optimizing 99 return null; {code} Should we do continue nextPlan instead of return null here since this is similar to udf or constant in local rearrange case {code} 105.columnChainInfo.insert(false, columns, DataType.TUPLE); {code} It would useful to put a comment explaining this is put into the ColumnChainInfo only for comparing that different components of SortKeyInfo are all coming from the same input index. Also should the datatype be BAG? {code} 118log.debug(node + have more than 1 predecessor); {code} predecessor should change to successor. {code} 217 if (currentNode instanceof POPackage 218 || currentNode instanceof POFilter 219 || currentNode instanceof POLimit) { {code} In line 217 we should ensure, we don't optimize when we encounter POJoinPackage using something like {code} if ((currentNode instanceof POPackage !(currentNode instanceof POJoinPackage)) {code} {code} 307. int errorCode = 1000; 327 int errorCode = 1000; 526. int errorCode = 1000; {code} This error code is already in use {code} 336 } else if (mapLeaf instanceof POUnion || mapLeaf instanceof POSplit) { 337 ListPhysicalOperator preds = mr.mapPlan 338 .getPredecessors(mapLeaf); 339 for (PhysicalOperator pred : preds) { 340 POLocalRearrange rearrange = (POLocalRearrange) pred; 341 rearrange.setUseSecondaryKey(true); 342 if (rearrange.getIndex() == indexOfRearrangeToChange) // Try 343 // to 344 // find 345 // the 346 // POLocalRearrange 347 // for 348 // the 349 // secondary 350 // key 351 setSecondaryPlan(mr.mapPlan, rearrange, 352 secondarySortKeyInfo); 353 } 354 } {code} The above should not contain POSplit since POSplit would only occur after multi query optimization which happens later. Also in the if statement on line 342, what if the condition evaluates to false - should we throw an Exception like earlier in the same method? {code} 530 if (r) 531 sawInvalidPhysicalOper = true; .. 557 if (r) // if we saw physical operator other than project in sort 558// plan 559 return; {code} At line 559 should we be setting sawInvalidPhysicalOper? General comments: = A comment on ColumnChainInfo and SortKeyInfo explaining how it tracks to POProjects in the plan would be useful POMultiQueryPackage should not change since SecondaryKeyOptimizer runs before MultiQueryOptimizer. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776821#action_12776821 ] Hadoop QA commented on PIG-1038: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424677/PIG-1038-4.patch against trunk revision 835005. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 7 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 209 javac compiler warnings (more than the trunk's current 199 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 320 release audit warnings (more than the trunk's current 318 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/44/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/44/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/44/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/44/console This message is automatically generated. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, PIG-1038-4.patch, PIG-1038-5.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776853#action_12776853 ] Pradeep Kamath commented on PIG-1038: - Changes look good. One observation is in SecondaryKeyOptimizer.java: {code} if (r) // if we saw physical operator other than project in sort // plan return; {code} should we be setting sawInvalidPhysicalOper? Other than that, +1 - please commit after making any change if required for the above. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch, PIG-1038-3.patch, PIG-1038-4.patch, PIG-1038-5.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775969#action_12775969 ] Ashutosh Chauhan commented on PIG-1038: --- Another place where Hadoop's secondary sort is useful in Pig is to sort the index entries for Merge Join. In indexing job of Merge Join, index entries sampled from map tasks are grouped in one reduce task where they are sorted before being written to disk. Currently, Pig does the sorting, but Hadoop's secondary sort can be used instead. This may not result in much performance gains since index is small in any case, but this may be a good test case for secondary key optimization. This depends on how you are discovering the pattern as I asked in previous question. If there is POSort immediately following POPackage or POJoinPackage in reducer and some other conditions are met we can apply Secondary key sorting optimization. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775982#action_12775982 ] Ashutosh Chauhan commented on PIG-1038: --- Note that in the use-case I mentioned of sorting index entries (btw, it can appear in user query as well) there is no POForEach , but secondary sort can still be applied and will be useful. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776175#action_12776175 ] Daniel Dai commented on PIG-1038: - Hi, Ashutosh, Good point. However, it is hard to get in for this release due to the schedule. You can open a separate Jira and we can address it afterward. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12774926#action_12774926 ] Hadoop QA commented on PIG-1038: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424332/PIG-1038-2.patch against trunk revision 833549. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 205 javac compiler warnings (more than the trunk's current 199 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. -1 release audit. The applied patch generated 319 release audit warnings (more than the trunk's current 317 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/145/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/145/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/145/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/145/console This message is automatically generated. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch, PIG-1038-2.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12774753#action_12774753 ] Hadoop QA commented on PIG-1038: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424289/PIG-1038-1.patch against trunk revision 833549. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 207 javac compiler warnings (more than the trunk's current 199 warnings). -1 findbugs. The patch appears to introduce 3 new Findbugs warnings. -1 release audit. The applied patch generated 319 release audit warnings (more than the trunk's current 317 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/143/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/143/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/143/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/143/console This message is automatically generated. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 Attachments: PIG-1038-1.patch If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772562#action_12772562 ] Daniel Dai commented on PIG-1038: - Hi, Ashutosh, I will look into POForeach and find the first nested sort or distinct, and use this sort/distinct key as the secondary sort key for this map-reduce job. So that I can take away/simplify the nested sort/distinct. Yes, we definitely need a framework for the map-reduce layer also. We will work on that, and welcome any suggestions and comments. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772774#action_12772774 ] Alan Gates commented on PIG-1038: - I agree that we need a framework for optimizations in the backend. I'm hoping we can reuse the framework from the front end. However, there's some cleanup we'd still like to do on the LogicalOptimizer before we use it as a template for a MapReduceOptimizer. But I agree that's where we need to go. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772404#action_12772404 ] Ashutosh Chauhan commented on PIG-1038: --- I think its a useful optimization. I presume this will be implemented as a visitor in MapReduceLauncher which visits on compiled MR plan. Design looks good. I have few questions: bq. 1.1 Discover if we use sort/distinct in nested foreach plan. How are you planning to discover ? Depending on some pattern like LR in map-plan followed by POPackage, POForeach, POSort in reduce-plan? Kind of orthogonal but related to this issue. We have rule-based optimizer framework in front-end, it seems to me that similar optimizer framework is required in backend too to refactor all the optimizer visitors we currently have and to add similar kind of optimizations easily in future. There are seven optimizations in front-end expressed through rules. On the other hand after addition of this one we will have nine optimization visitors in backend. May be we can think about it to avoid lot of rework every time such optimization is added. Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772085#action_12772085 ] Daniel Dai commented on PIG-1038: - Here is the design for this optimization: 1. Add SecondaryKeyOptimizer, which optimize map-reduce plan. It will 1.1 Discover if we use sort/distinct in nested foreach plan. 1.2 For the first such sort/distinct, use the sort/distinct key as the secondary key 1.3 Once SecondaryKeyOptimizer discover secondary key, it will call POLocalRearrange.setSecondaryPlan, then drop sort or simplify distinct 2. Change POLocalRearrange 2.1 Add setSecondaryPlan to provide a way to set secondary plan for SecondaryKeyOptimizer 2.2 Change constructLROutput to make a compound key, which is a tuple: (key, secondaryKey) 2.3 We need to duplicate the logic to strip key from values for the secondary key as well 3. Change POPackageAnnotator to patch POPackage with the keyinfo from both key and secondaryKey 4. Change POPackage to stitch secondary key to the value 5. Change MapReduceOper to indicate that map-reduce operator needs secondary key, and JobControlCompiler will set OutputValueGroupingComparator to use the mainKeyComparator 6. Add mainKeyComparator which inherits PigNullableWritable and only compare the main key. We need that for the OutputValueGroupingComparator Optimize nested distinct/sort to use secondary key -- Key: PIG-1038 URL: https://issues.apache.org/jira/browse/PIG-1038 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.4.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.6.0 If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. Eg1: A = load 'mydata'; B = group A by $0; C = foreach B { D = order A by $1; generate group, D; } store C into 'myresult'; We can specify a secondary sort on A.$1, and drop order A by $1. Eg2: A = load 'mydata'; B = group A by $0; C = foreach B { D = A.$1; E = distinct D; generate group, E; } store C into 'myresult'; We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct D to a special version of distinct, which does not do the sorting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.