This isn't going to be very efficient -- Pig will figure out that it can do COUNT in a distributed fashion (count produced on each mapper, and summed at the reducer)
Normally, TOP can be distributed as well ( top 3 of 100 items is top 3 of (top 3 of first 20, top 3 of next 20, etc)). But since in this case Pig won't know how many of the top items to keep on a mapper until it's done the count, it won't kick into this optimization. If you are dealing with large datasets, calculating the count in a separate group-all, as in the example in the jira I linked to, is going to be much better. D On Thu, Sep 8, 2011 at 12:46 PM, Ruslan Al-Fakikh < [email protected]> wrote: > Thank you guys! It worked for me: > > This is to get top 20%: > > A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: > int); > B = GROUP A BY category; > > topResults = FOREACH B { > count = COUNT(A); > result = TOP((int)(count * (20 / 100.0)), 2, A); > GENERATE FLATTEN(result); > } > > dump topResults; > > On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[email protected]> > wrote: > > Hi Dmitriy -- great info, thanks. > > > > On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[email protected]> > wrote: > >> You could also do it with TOP as Norbert suggests, but that has a bit of > >> extra cost due to the sort TOP does. > > > > Just for my understanding, doesn't the ORDER BY in the PIG-1926 > > example impose the same sort cost? Seems that you'd have pay for a > > sort as long as the requirement is top N. > > > > Norbert > > > >> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger < > [email protected]>wrote: > >> > >>> Hi Ruslan -- no need to write your own UDF. There is a built-in > >>> function TOP() which will extract for you the top N tuples of a > >>> relation, where N is a configurable parameter. Take a look at: > >>> > >>> http://pig.apache.org/docs/r0.9.0/func.html#topx > >>> > >>> Norbert > >>> > >>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh > >>> <[email protected]> wrote: > >>> > Hey guys, > >>> > > >>> > How can I LIMIT a relation by percentage? > >>> > What I need is to sort a relation by a numeric column and then take > >>> > top 5% of tuples. > >>> > As far as I understand I cannot use an expression in the LIMIT > >>> > operator. Do I have to write my own UDF? What type of UDF should I > use > >>> > then? > >>> > > >>> > -- > >>> > Best Regards, > >>> > Ruslan Al-Fakikh > >>> > > >>> > >> > > > > > > -- > Best Regards, > Ruslan Al-Fakikh >
