Thanks Alan, It was indeed a purely academic question. I've had no issues at all with the limits or order by not working in Pig. I'm a happy Pig user ;)
Cheers, Josh On 15 November 2010 21:56, Alan Gates <[email protected]> wrote: > POSort is only used for sorts of bags in memory (such as sort inside a > foreach) not top level sorts. In both cases the physical operators only > capture part of the actual operations, since much of the work is done by the > Hadoop framework. > > Very briefly, order by works by taking a sample of the input, building a > partitioner that will produce a balanced total ordering of the data (that > is, each part file will be approximately the same size) and then running an > MR job that uses the order by key as the grouping key along with the just > built partitioner. Limit works by applying the limit to each mapper and > then running a reduce pass in a single reduce, again applying the limit. > > Are these questions purely academic or are their applications where you'd > like to use Pig's order and limit but you can't do the other processing in > Pig? If the latter, I'd recommend checking out the new mapreduce command > introduced in 0.8 (which we'll release here in a week or two I hope) which > allows you to invoke MR jobs from Pig. You can learn more about this at > https://issues.apache.org/jira/browse/PIG-506. You can also see the > documentation for this feature in > http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml?view=markup(search > on MAPREDUCE). Sorry, this is the forrest version. You can also > see it in html by checking out the code and building it yourself. > > Alan. > > > On Nov 15, 2010, at 12:50 AM, Rekha Joshi wrote: > > Hi Josh >> >> AFAIR, all relationaloperators reside in source PO*.java under >> o.a.p.backend.hadoop.executionengine.physicalLayer.relationalOperators. >> Alternatively check POLimit, POSort under >> http://pig.apache.org/docs/r0.7.0/api/ >> >> PigServer is the starting point. and internally will have formations of >> logical/physical plan of jobs.The executionengine executes the job. Refer >> files under o.a.p.backend.hadoop.executionengine. >> More details under http://wiki.apache.org/pig/PigExecutionModel >> >> Thanks & Regards, >> /Rekha. >> >> On 11/14/10 7:59 PM, "Josh Devins" <[email protected]> wrote: >> >> Hi all, >> >> I'm happily using Pig to ORDER BY and LIMIT some large relations quite >> effectively. However I'm curious about how these are/would be implemented >> in >> "raw" MapReduce. Can anyone shed some light/point to some details, >> examples >> or pseudo-code somewhere? >> >> Cheers, >> >> Josh >> >> >
