Hello Dmitriy,
I guess you mean this example:
a = LOAD 'a.txt';
b = GROUP a all;
c = FOREACH b GENERATE COUNT(a) AS sum;
d = ORDER a BY $0;
e = LIMIT d c.sum/100;
But here they group all tuples.
In my example:
A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: int);
B = GROUP A BY category;
topResults = FOREACH B {
count = COUNT(A);
result = TOP((int)(count * (20 / 100.0)), 2, A);
GENERATE FLATTEN(result);
}
I group by category. Actually what I need in the end is to take top
20% visitors (visitors with the biggest numbers of impressions) per
category.
So, probably it can't be optimized, or am I missing something?
Thanks in advance!
On Fri, Sep 9, 2011 at 4:43 AM, Dmitriy Ryaboy <[email protected]> wrote:
> This isn't going to be very efficient -- Pig will figure out that it can do
> COUNT in a distributed fashion (count produced on each mapper, and summed at
> the reducer)
>
> Normally, TOP can be distributed as well ( top 3 of 100 items is top 3 of
> (top 3 of first 20, top 3 of next 20, etc)). But since in this case Pig
> won't know how many of the top items to keep on a mapper until it's done the
> count, it won't kick into this optimization. If you are dealing with large
> datasets, calculating the count in a separate group-all, as in the example
> in the jira I linked to, is going to be much better.
>
> D
>
> On Thu, Sep 8, 2011 at 12:46 PM, Ruslan Al-Fakikh <
> [email protected]> wrote:
>
>> Thank you guys! It worked for me:
>>
>> This is to get top 20%:
>>
>> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions:
>> int);
>> B = GROUP A BY category;
>>
>> topResults = FOREACH B {
>> count = COUNT(A);
>> result = TOP((int)(count * (20 / 100.0)), 2, A);
>> GENERATE FLATTEN(result);
>> }
>>
>> dump topResults;
>>
>> On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[email protected]>
>> wrote:
>> > Hi Dmitriy -- great info, thanks.
>> >
>> > On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[email protected]>
>> wrote:
>> >> You could also do it with TOP as Norbert suggests, but that has a bit of
>> >> extra cost due to the sort TOP does.
>> >
>> > Just for my understanding, doesn't the ORDER BY in the PIG-1926
>> > example impose the same sort cost? Seems that you'd have pay for a
>> > sort as long as the requirement is top N.
>> >
>> > Norbert
>> >
>> >> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger <
>> [email protected]>wrote:
>> >>
>> >>> Hi Ruslan -- no need to write your own UDF. There is a built-in
>> >>> function TOP() which will extract for you the top N tuples of a
>> >>> relation, where N is a configurable parameter. Take a look at:
>> >>>
>> >>> http://pig.apache.org/docs/r0.9.0/func.html#topx
>> >>>
>> >>> Norbert
>> >>>
>> >>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh
>> >>> <[email protected]> wrote:
>> >>> > Hey guys,
>> >>> >
>> >>> > How can I LIMIT a relation by percentage?
>> >>> > What I need is to sort a relation by a numeric column and then take
>> >>> > top 5% of tuples.
>> >>> > As far as I understand I cannot use an expression in the LIMIT
>> >>> > operator. Do I have to write my own UDF? What type of UDF should I
>> use
>> >>> > then?
>> >>> >
>> >>> > --
>> >>> > Best Regards,
>> >>> > Ruslan Al-Fakikh
>> >>> >
>> >>>
>> >>
>> >
>>
>>
>>
>> --
>> Best Regards,
>> Ruslan Al-Fakikh
>>
>
--
Best Regards,
Ruslan Al-Fakikh