[ https://issues.apache.org/jira/browse/PIG-841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pradeep Kamath updated PIG-841: ------------------------------- Summary: PERFORMANCE: The sample MR job in order by (or joins which require sampling) implementation can use Hadoop sorting instead of doing a POSort (was: PERFORMANCE: The sample MR job in order by implementation can use Hadoop sorting instead of doing a POSort) > PERFORMANCE: The sample MR job in order by (or joins which require sampling) > implementation can use Hadoop sorting instead of doing a POSort > -------------------------------------------------------------------------------------------------------------------------------------------- > > Key: PIG-841 > URL: https://issues.apache.org/jira/browse/PIG-841 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.2.1 > Reporter: Pradeep Kamath > Fix For: 0.3.0 > > > Currently the sample map reduce job in order by implementation does the > following: > - sample 100 records from each map > - group all on the above output > - sort the output bag from the above grouping on keys of the order by > - give the sorted bag to FindQuantiles udf > The steps 2 and 3 above can be replaced by > - group the sample output by the order by key and set parallelism of the > group to 1 so that output of the group goes to one reducer. Since Hadoop > ensures the output of the group is sorted by key we get sorting for free > without using POSort -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.