This isn't going to be very efficient -- Pig will figure out that it can do
COUNT in a distributed fashion (count produced on each mapper, and summed at
the reducer)

Normally, TOP can be distributed as well ( top 3 of 100 items is top 3 of
(top 3 of first 20, top 3 of next 20, etc)).  But since in this case Pig
won't know how many of the top items to keep on a mapper until it's done the
count, it won't kick into this optimization.  If you are dealing with large
datasets, calculating the count in a separate group-all, as in the example
in the jira I linked to, is going to be much better.

D

On Thu, Sep 8, 2011 at 12:46 PM, Ruslan Al-Fakikh <
[email protected]> wrote:

> Thank you guys! It worked for me:
>
> This is to get top 20%:
>
> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions:
> int);
> B = GROUP A BY category;
>
> topResults = FOREACH B {
>    count = COUNT(A);
>    result = TOP((int)(count * (20 / 100.0)), 2, A);
>      GENERATE FLATTEN(result);
> }
>
> dump topResults;
>
> On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[email protected]>
> wrote:
> > Hi Dmitriy -- great info, thanks.
> >
> > On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[email protected]>
> wrote:
> >> You could also do it with TOP as Norbert suggests, but that has a bit of
> >> extra cost due to the sort TOP does.
> >
> > Just for my understanding, doesn't the ORDER BY in the PIG-1926
> > example impose the same sort cost?  Seems that you'd have pay for a
> > sort as long as the requirement is top N.
> >
> > Norbert
> >
> >> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger <
> [email protected]>wrote:
> >>
> >>> Hi Ruslan -- no need to write your own UDF.  There is a built-in
> >>> function TOP() which will extract for you the top N tuples of a
> >>> relation, where N is a configurable parameter.  Take a look at:
> >>>
> >>> http://pig.apache.org/docs/r0.9.0/func.html#topx
> >>>
> >>> Norbert
> >>>
> >>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh
> >>> <[email protected]> wrote:
> >>> > Hey guys,
> >>> >
> >>> > How can I LIMIT a relation by percentage?
> >>> > What I need is to sort a relation by a numeric column and then take
> >>> > top 5% of tuples.
> >>> > As far as I understand I cannot use an expression in the LIMIT
> >>> > operator. Do I have to write my own UDF? What type of UDF should I
> use
> >>> > then?
> >>> >
> >>> > --
> >>> > Best Regards,
> >>> > Ruslan Al-Fakikh
> >>> >
> >>>
> >>
> >
>
>
>
> --
> Best Regards,
> Ruslan Al-Fakikh
>

Reply via email to