Re: Implementation of ORDER and LIMIT

Alan Gates Mon, 15 Nov 2010 12:57:59 -0800

POSort is only used for sorts of bags in memory (such as sort inside aforeach) not top level sorts. In both cases the physical operatorsonly capture part of the actual operations, since much of the work isdone by the Hadoop framework.

Very briefly, order by works by taking a sample of the input, buildinga partitioner that will produce a balanced total ordering of the data(that is, each part file will be approximately the same size) and thenrunning an MR job that uses the order by key as the grouping key alongwith the just built partitioner. Limit works by applying the limit toeach mapper and then running a reduce pass in a single reduce, againapplying the limit.

Are these questions purely academic or are their applications whereyou'd like to use Pig's order and limit but you can't do the otherprocessing in Pig? If the latter, I'd recommend checking out the newmapreduce command introduced in 0.8 (which we'll release here in aweek or two I hope) which allows you to invoke MR jobs from Pig. Youcan learn more about this at https://issues.apache.org/jira/browse/PIG-506. You can also see the documentation for this feature in http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml?view=markup(search on MAPREDUCE). Sorry, this is the forrest version. You canalso see it in html by checking out the code and building it yourself.


Alan.

On Nov 15, 2010, at 12:50 AM, Rekha Joshi wrote:

Hi Josh
AFAIR, all relationaloperators reside in source PO*.java undero.a.p.backend.hadoop.executionengine.physicalLayer.relationalOperators.
Alternatively check POLimit, POSort under http://pig.apache.org/docs/r0.7.0/api/
PigServer is the starting point. and internally will have formationsof logical/physical plan of jobs.The executionengine executes thejob. Refer files under o.a.p.backend.hadoop.executionengine.
More details under http://wiki.apache.org/pig/PigExecutionModel

Thanks & Regards,
/Rekha.

On 11/14/10 7:59 PM, "Josh Devins" <[email protected]> wrote:

Hi all,

I'm happily using Pig to ORDER BY and LIMIT some large relations quite
effectively. However I'm curious about how these are/would beimplemented in"raw" MapReduce. Can anyone shed some light/point to some details,examples
or pseudo-code somewhere?

Cheers,

Josh

Re: Implementation of ORDER and LIMIT

Reply via email to