[ https://issues.apache.org/jira/browse/PIG-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704815#action_12704815 ]
Dick King commented on PIG-791: ------------------------------- I am considering a modification to hadoop that would allow users to designate that a map/reduce output is a: sorted, b: likely to be the input to some other map/reduce where selected keys are re-emitted by the second mapper unchanged, with probability not correlated by the ordering, and c: the same sort order is used in the second map/reduce. It would work by writing a sample file as a secondary output of the mapper in the first map/reduce. This proposal in on my back burner, but could potentially be moved up. Would that functionality be generally useful here? > reducing number of MR stages with ORDER BY > ------------------------------------------ > > Key: PIG-791 > URL: https://issues.apache.org/jira/browse/PIG-791 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.2.0 > Reporter: Olga Natkovich > > When an order by is not the only operation in a pig script, it is done in two > additional MR jobs. The first job samples using a sampling loader, the second > does the sort. The sample is used to construct a partitioner that equally > balances the data in the sort. The sampler needs to be changed to be a > EvalFunc instead of a loader. This way a split can be but in the proceeding > MR job, with the main data being written out and the other part flowing to > the sampler func, which can then write out the sample. The final MR job can > then be the sort. > This change depends on multiquery code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.