POSort is only used for sorts of bags in memory (such as sort inside a foreach) not top level sorts. In both cases the physical operators only capture part of the actual operations, since much of the work is done by the Hadoop framework.

Very briefly, order by works by taking a sample of the input, building a partitioner that will produce a balanced total ordering of the data (that is, each part file will be approximately the same size) and then running an MR job that uses the order by key as the grouping key along with the just built partitioner. Limit works by applying the limit to each mapper and then running a reduce pass in a single reduce, again applying the limit.

Are these questions purely academic or are their applications where you'd like to use Pig's order and limit but you can't do the other processing in Pig? If the latter, I'd recommend checking out the new mapreduce command introduced in 0.8 (which we'll release here in a week or two I hope) which allows you to invoke MR jobs from Pig. You can learn more about this at https://issues.apache.org/jira/browse/PIG-506 . You can also see the documentation for this feature in http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml?view=markup (search on MAPREDUCE). Sorry, this is the forrest version. You can also see it in html by checking out the code and building it yourself.

Alan.

On Nov 15, 2010, at 12:50 AM, Rekha Joshi wrote:

Hi Josh

AFAIR, all relationaloperators reside in source PO*.java under o .a.p.backend.hadoop.executionengine.physicalLayer.relationalOperators.
Alternatively check POLimit, POSort under http://pig.apache.org/docs/r0.7.0/api/

PigServer is the starting point. and internally will have formations of logical/physical plan of jobs.The executionengine executes the job. Refer files under o.a.p.backend.hadoop.executionengine.
More details under http://wiki.apache.org/pig/PigExecutionModel

Thanks & Regards,
/Rekha.

On 11/14/10 7:59 PM, "Josh Devins" <[email protected]> wrote:

Hi all,

I'm happily using Pig to ORDER BY and LIMIT some large relations quite
effectively. However I'm curious about how these are/would be implemented in "raw" MapReduce. Can anyone shed some light/point to some details, examples
or pseudo-code somewhere?

Cheers,

Josh


Reply via email to