Just replace the call to TOP with a call to limit. In trunk, limit takes expressions as arguments (it only took constants before)
On Sep 9, 2011, at 4:20 AM, Ruslan Al-Fakikh <[email protected]> wrote: > Hello Dmitriy, > > I guess you mean this example: > a = LOAD 'a.txt'; > b = GROUP a all; > c = FOREACH b GENERATE COUNT(a) AS sum; > d = ORDER a BY $0; > e = LIMIT d c.sum/100; > > But here they group all tuples. > > In my example: > A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: > int); > B = GROUP A BY category; > > topResults = FOREACH B { > count = COUNT(A); > result = TOP((int)(count * (20 / 100.0)), 2, A); > GENERATE FLATTEN(result); > } > > I group by category. Actually what I need in the end is to take top > 20% visitors (visitors with the biggest numbers of impressions) per > category. > So, probably it can't be optimized, or am I missing something? > > Thanks in advance! > > On Fri, Sep 9, 2011 at 4:43 AM, Dmitriy Ryaboy <[email protected]> wrote: >> This isn't going to be very efficient -- Pig will figure out that it can do >> COUNT in a distributed fashion (count produced on each mapper, and summed at >> the reducer) >> >> Normally, TOP can be distributed as well ( top 3 of 100 items is top 3 of >> (top 3 of first 20, top 3 of next 20, etc)). But since in this case Pig >> won't know how many of the top items to keep on a mapper until it's done the >> count, it won't kick into this optimization. If you are dealing with large >> datasets, calculating the count in a separate group-all, as in the example >> in the jira I linked to, is going to be much better. >> >> D >> >> On Thu, Sep 8, 2011 at 12:46 PM, Ruslan Al-Fakikh < >> [email protected]> wrote: >> >>> Thank you guys! It worked for me: >>> >>> This is to get top 20%: >>> >>> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: >>> int); >>> B = GROUP A BY category; >>> >>> topResults = FOREACH B { >>> count = COUNT(A); >>> result = TOP((int)(count * (20 / 100.0)), 2, A); >>> GENERATE FLATTEN(result); >>> } >>> >>> dump topResults; >>> >>> On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[email protected]> >>> wrote: >>>> Hi Dmitriy -- great info, thanks. >>>> >>>> On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[email protected]> >>> wrote: >>>>> You could also do it with TOP as Norbert suggests, but that has a bit of >>>>> extra cost due to the sort TOP does. >>>> >>>> Just for my understanding, doesn't the ORDER BY in the PIG-1926 >>>> example impose the same sort cost? Seems that you'd have pay for a >>>> sort as long as the requirement is top N. >>>> >>>> Norbert >>>> >>>>> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger < >>> [email protected]>wrote: >>>>> >>>>>> Hi Ruslan -- no need to write your own UDF. There is a built-in >>>>>> function TOP() which will extract for you the top N tuples of a >>>>>> relation, where N is a configurable parameter. Take a look at: >>>>>> >>>>>> http://pig.apache.org/docs/r0.9.0/func.html#topx >>>>>> >>>>>> Norbert >>>>>> >>>>>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh >>>>>> <[email protected]> wrote: >>>>>>> Hey guys, >>>>>>> >>>>>>> How can I LIMIT a relation by percentage? >>>>>>> What I need is to sort a relation by a numeric column and then take >>>>>>> top 5% of tuples. >>>>>>> As far as I understand I cannot use an expression in the LIMIT >>>>>>> operator. Do I have to write my own UDF? What type of UDF should I >>> use >>>>>>> then? >>>>>>> >>>>>>> -- >>>>>>> Best Regards, >>>>>>> Ruslan Al-Fakikh >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>> >>> -- >>> Best Regards, >>> Ruslan Al-Fakikh >>> >> > > > > -- > Best Regards, > Ruslan Al-Fakikh
