But we are now on 0.8 version and planning to move to 0.9, so we are far away from 0.10
So, I guess my way is the only one for now:( On Fri, Sep 9, 2011 at 8:19 PM, Dmitriy Ryaboy <[email protected]> wrote: > Just replace the call to TOP with a call to limit. In trunk, limit takes > expressions as arguments (it only took constants before) > > On Sep 9, 2011, at 4:20 AM, Ruslan Al-Fakikh <[email protected]> > wrote: > >> Hello Dmitriy, >> >> I guess you mean this example: >> a = LOAD 'a.txt'; >> b = GROUP a all; >> c = FOREACH b GENERATE COUNT(a) AS sum; >> d = ORDER a BY $0; >> e = LIMIT d c.sum/100; >> >> But here they group all tuples. >> >> In my example: >> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: >> int); >> B = GROUP A BY category; >> >> topResults = FOREACH B { >> count = COUNT(A); >> result = TOP((int)(count * (20 / 100.0)), 2, A); >> GENERATE FLATTEN(result); >> } >> >> I group by category. Actually what I need in the end is to take top >> 20% visitors (visitors with the biggest numbers of impressions) per >> category. >> So, probably it can't be optimized, or am I missing something? >> >> Thanks in advance! >> >> On Fri, Sep 9, 2011 at 4:43 AM, Dmitriy Ryaboy <[email protected]> wrote: >>> This isn't going to be very efficient -- Pig will figure out that it can do >>> COUNT in a distributed fashion (count produced on each mapper, and summed at >>> the reducer) >>> >>> Normally, TOP can be distributed as well ( top 3 of 100 items is top 3 of >>> (top 3 of first 20, top 3 of next 20, etc)). But since in this case Pig >>> won't know how many of the top items to keep on a mapper until it's done the >>> count, it won't kick into this optimization. If you are dealing with large >>> datasets, calculating the count in a separate group-all, as in the example >>> in the jira I linked to, is going to be much better. >>> >>> D >>> >>> On Thu, Sep 8, 2011 at 12:46 PM, Ruslan Al-Fakikh < >>> [email protected]> wrote: >>> >>>> Thank you guys! It worked for me: >>>> >>>> This is to get top 20%: >>>> >>>> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: >>>> int); >>>> B = GROUP A BY category; >>>> >>>> topResults = FOREACH B { >>>> count = COUNT(A); >>>> result = TOP((int)(count * (20 / 100.0)), 2, A); >>>> GENERATE FLATTEN(result); >>>> } >>>> >>>> dump topResults; >>>> >>>> On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[email protected]> >>>> wrote: >>>>> Hi Dmitriy -- great info, thanks. >>>>> >>>>> On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[email protected]> >>>> wrote: >>>>>> You could also do it with TOP as Norbert suggests, but that has a bit of >>>>>> extra cost due to the sort TOP does. >>>>> >>>>> Just for my understanding, doesn't the ORDER BY in the PIG-1926 >>>>> example impose the same sort cost? Seems that you'd have pay for a >>>>> sort as long as the requirement is top N. >>>>> >>>>> Norbert >>>>> >>>>>> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger < >>>> [email protected]>wrote: >>>>>> >>>>>>> Hi Ruslan -- no need to write your own UDF. There is a built-in >>>>>>> function TOP() which will extract for you the top N tuples of a >>>>>>> relation, where N is a configurable parameter. Take a look at: >>>>>>> >>>>>>> http://pig.apache.org/docs/r0.9.0/func.html#topx >>>>>>> >>>>>>> Norbert >>>>>>> >>>>>>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh >>>>>>> <[email protected]> wrote: >>>>>>>> Hey guys, >>>>>>>> >>>>>>>> How can I LIMIT a relation by percentage? >>>>>>>> What I need is to sort a relation by a numeric column and then take >>>>>>>> top 5% of tuples. >>>>>>>> As far as I understand I cannot use an expression in the LIMIT >>>>>>>> operator. Do I have to write my own UDF? What type of UDF should I >>>> use >>>>>>>> then? >>>>>>>> >>>>>>>> -- >>>>>>>> Best Regards, >>>>>>>> Ruslan Al-Fakikh >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Ruslan Al-Fakikh >>>> >>> >> >> >> >> -- >> Best Regards, >> Ruslan Al-Fakikh > -- Best Regards, Ruslan Al-Fakikh
