[ 
https://issues.apache.org/jira/browse/PIG-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759815#action_12759815
 ] 

Alan Gates commented on PIG-980:
--------------------------------

A common pattern for Pig Latin scripts is:

{code}
A = load 'bla';
B = group A by $0;
C = foreach B {
    D = order A by $1;
    ...
}
{code}

Currently Pig executes this by using POSort on the reduce side, which collects 
all of the records out of the bag produced by POPackage into
a SortedBag.  If this bag is large, it will spill both as part of POPackage 
collecting it and as part of POSort sorting it.

None of this is necessary however.  Hadoop allows users to specify a sort order 
for data going to the reducer in addition to a partition
key.  This can be done by defining the Comparator for the job to compare all 
the fields you want sorted, and the Partitioner to only look
at the field you want to partition on.  So in this case the partitioner would 
be set to look at $0, and the comparator at $0, and $1.

Beyond avoiding unnecessary sorts and spills, this will also allow us to use 
the proposed Accumulator interface (see PIG-979) for these types
of scripts.


> Optimizing nested order bys
> ---------------------------
>
>                 Key: PIG-980
>                 URL: https://issues.apache.org/jira/browse/PIG-980
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Alan Gates
>            Assignee: Ying He
>
> Pig needs to take advantage of secondary sort in Hadoop to optimize nested 
> order bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to