[ 
https://issues.apache.org/jira/browse/PIG-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704815#action_12704815
 ] 

Dick King commented on PIG-791:
-------------------------------

I am considering a modification to hadoop that would allow users to designate 
that a map/reduce output is a: sorted, b: likely to be the input to some other 
map/reduce where selected keys are re-emitted by the second mapper unchanged, 
with probability not correlated by the ordering, and c: the same sort order is 
used in the second map/reduce.

It would work by writing a sample file as a secondary output of the mapper in 
the first map/reduce.

This proposal in on my back burner, but could potentially be moved up.

Would that functionality be generally useful here?




> reducing number of MR stages with ORDER BY
> ------------------------------------------
>
>                 Key: PIG-791
>                 URL: https://issues.apache.org/jira/browse/PIG-791
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.0
>            Reporter: Olga Natkovich
>
> When an order by is not the only operation in a pig script, it is done in two 
> additional MR jobs. The first job samples using a sampling loader, the second 
> does the sort. The sample is used to construct a partitioner that equally 
> balances the data in the sort. The sampler needs to be changed to be a 
> EvalFunc instead of a loader. This way a split can be but in the proceeding 
> MR job, with the main data being written out and the other part flowing to 
> the sampler func, which can then write out the sample. The final MR job can 
> then be the sort. 
> This change depends on multiquery code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to