[ https://issues.apache.org/jira/browse/PIG-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759815#action_12759815 ]
Alan Gates commented on PIG-980: -------------------------------- A common pattern for Pig Latin scripts is: {code} A = load 'bla'; B = group A by $0; C = foreach B { D = order A by $1; ... } {code} Currently Pig executes this by using POSort on the reduce side, which collects all of the records out of the bag produced by POPackage into a SortedBag. If this bag is large, it will spill both as part of POPackage collecting it and as part of POSort sorting it. None of this is necessary however. Hadoop allows users to specify a sort order for data going to the reducer in addition to a partition key. This can be done by defining the Comparator for the job to compare all the fields you want sorted, and the Partitioner to only look at the field you want to partition on. So in this case the partitioner would be set to look at $0, and the comparator at $0, and $1. Beyond avoiding unnecessary sorts and spills, this will also allow us to use the proposed Accumulator interface (see PIG-979) for these types of scripts. > Optimizing nested order bys > --------------------------- > > Key: PIG-980 > URL: https://issues.apache.org/jira/browse/PIG-980 > Project: Pig > Issue Type: Improvement > Reporter: Alan Gates > Assignee: Ying He > > Pig needs to take advantage of secondary sort in Hadoop to optimize nested > order bys. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.