Thank you guys! It worked for me:
This is to get top 20%:
A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: int);
B = GROUP A BY category;
topResults = FOREACH B {
count = COUNT(A);
result = TOP((int)(count * (20 / 100.0)), 2, A);
GENERATE FLATTEN(result);
}
dump topResults;
On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[email protected]> wrote:
> Hi Dmitriy -- great info, thanks.
>
> On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[email protected]> wrote:
>> You could also do it with TOP as Norbert suggests, but that has a bit of
>> extra cost due to the sort TOP does.
>
> Just for my understanding, doesn't the ORDER BY in the PIG-1926
> example impose the same sort cost? Seems that you'd have pay for a
> sort as long as the requirement is top N.
>
> Norbert
>
>> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger
>> <[email protected]>wrote:
>>
>>> Hi Ruslan -- no need to write your own UDF. There is a built-in
>>> function TOP() which will extract for you the top N tuples of a
>>> relation, where N is a configurable parameter. Take a look at:
>>>
>>> http://pig.apache.org/docs/r0.9.0/func.html#topx
>>>
>>> Norbert
>>>
>>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh
>>> <[email protected]> wrote:
>>> > Hey guys,
>>> >
>>> > How can I LIMIT a relation by percentage?
>>> > What I need is to sort a relation by a numeric column and then take
>>> > top 5% of tuples.
>>> > As far as I understand I cannot use an expression in the LIMIT
>>> > operator. Do I have to write my own UDF? What type of UDF should I use
>>> > then?
>>> >
>>> > --
>>> > Best Regards,
>>> > Ruslan Al-Fakikh
>>> >
>>>
>>
>
--
Best Regards,
Ruslan Al-Fakikh