POSort is only used for sorts of bags in memory (such as sort inside a
foreach) not top level sorts. In both cases the physical operators
only capture part of the actual operations, since much of the work is
done by the Hadoop framework.
Very briefly, order by works by taking a sample of the input, building
a partitioner that will produce a balanced total ordering of the data
(that is, each part file will be approximately the same size) and then
running an MR job that uses the order by key as the grouping key along
with the just built partitioner. Limit works by applying the limit to
each mapper and then running a reduce pass in a single reduce, again
applying the limit.
Are these questions purely academic or are their applications where
you'd like to use Pig's order and limit but you can't do the other
processing in Pig? If the latter, I'd recommend checking out the new
mapreduce command introduced in 0.8 (which we'll release here in a
week or two I hope) which allows you to invoke MR jobs from Pig. You
can learn more about this at https://issues.apache.org/jira/browse/PIG-506
. You can also see the documentation for this feature in http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml?view=markup
(search on MAPREDUCE). Sorry, this is the forrest version. You can
also see it in html by checking out the code and building it yourself.
Alan.
On Nov 15, 2010, at 12:50 AM, Rekha Joshi wrote:
Hi Josh
AFAIR, all relationaloperators reside in source PO*.java under
o
.a.p.backend.hadoop.executionengine.physicalLayer.relationalOperators.
Alternatively check POLimit, POSort under http://pig.apache.org/docs/r0.7.0/api/
PigServer is the starting point. and internally will have formations
of logical/physical plan of jobs.The executionengine executes the
job. Refer files under o.a.p.backend.hadoop.executionengine.
More details under http://wiki.apache.org/pig/PigExecutionModel
Thanks & Regards,
/Rekha.
On 11/14/10 7:59 PM, "Josh Devins" <[email protected]> wrote:
Hi all,
I'm happily using Pig to ORDER BY and LIMIT some large relations quite
effectively. However I'm curious about how these are/would be
implemented in
"raw" MapReduce. Can anyone shed some light/point to some details,
examples
or pseudo-code somewhere?
Cheers,
Josh