Re: How to LIMIT a relation by percentage

Dmitriy Ryaboy Fri, 09 Sep 2011 09:19:49 -0700

Just replace the call to TOP with a call to limit. In trunk, limit takes 
expressions as arguments (it only took constants before)


On Sep 9, 2011, at 4:20 AM, Ruslan Al-Fakikh <[email protected]> wrote:

> Hello Dmitriy,
> 
> I guess you mean this example:
> a = LOAD 'a.txt';
> b = GROUP a all;
> c = FOREACH b GENERATE COUNT(a) AS sum;
> d = ORDER a BY $0;
> e = LIMIT d c.sum/100;
> 
> But here they group all tuples.
> 
> In my example:
> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions: 
> int);
> B = GROUP A BY category;
> 
> topResults = FOREACH B {
>   count = COUNT(A);
>   result = TOP((int)(count * (20 / 100.0)), 2, A);
>     GENERATE FLATTEN(result);
> }
> 
> I group by category. Actually what I need in the end is to take top
> 20% visitors (visitors with the biggest numbers of impressions) per
> category.
> So, probably it can't be optimized, or am I missing something?
> 
> Thanks in advance!
> 
> On Fri, Sep 9, 2011 at 4:43 AM, Dmitriy Ryaboy <[email protected]> wrote:
>> This isn't going to be very efficient -- Pig will figure out that it can do
>> COUNT in a distributed fashion (count produced on each mapper, and summed at
>> the reducer)
>> 
>> Normally, TOP can be distributed as well ( top 3 of 100 items is top 3 of
>> (top 3 of first 20, top 3 of next 20, etc)).  But since in this case Pig
>> won't know how many of the top items to keep on a mapper until it's done the
>> count, it won't kick into this optimization.  If you are dealing with large
>> datasets, calculating the count in a separate group-all, as in the example
>> in the jira I linked to, is going to be much better.
>> 
>> D
>> 
>> On Thu, Sep 8, 2011 at 12:46 PM, Ruslan Al-Fakikh <
>> [email protected]> wrote:
>> 
>>> Thank you guys! It worked for me:
>>> 
>>> This is to get top 20%:
>>> 
>>> A = LOAD 'input' as (category: chararray, visitor: chararray, impressions:
>>> int);
>>> B = GROUP A BY category;
>>> 
>>> topResults = FOREACH B {
>>>    count = COUNT(A);
>>>    result = TOP((int)(count * (20 / 100.0)), 2, A);
>>>      GENERATE FLATTEN(result);
>>> }
>>> 
>>> dump topResults;
>>> 
>>> On Thu, Sep 8, 2011 at 9:03 PM, Norbert Burger <[email protected]>
>>> wrote:
>>>> Hi Dmitriy -- great info, thanks.
>>>> 
>>>> On Thu, Sep 8, 2011 at 12:19 PM, Dmitriy Ryaboy <[email protected]>
>>> wrote:
>>>>> You could also do it with TOP as Norbert suggests, but that has a bit of
>>>>> extra cost due to the sort TOP does.
>>>> 
>>>> Just for my understanding, doesn't the ORDER BY in the PIG-1926
>>>> example impose the same sort cost?  Seems that you'd have pay for a
>>>> sort as long as the requirement is top N.
>>>> 
>>>> Norbert
>>>> 
>>>>> On Thu, Sep 8, 2011 at 6:42 AM, Norbert Burger <
>>> [email protected]>wrote:
>>>>> 
>>>>>> Hi Ruslan -- no need to write your own UDF.  There is a built-in
>>>>>> function TOP() which will extract for you the top N tuples of a
>>>>>> relation, where N is a configurable parameter.  Take a look at:
>>>>>> 
>>>>>> http://pig.apache.org/docs/r0.9.0/func.html#topx
>>>>>> 
>>>>>> Norbert
>>>>>> 
>>>>>> On Thu, Sep 8, 2011 at 9:13 AM, Ruslan Al-Fakikh
>>>>>> <[email protected]> wrote:
>>>>>>> Hey guys,
>>>>>>> 
>>>>>>> How can I LIMIT a relation by percentage?
>>>>>>> What I need is to sort a relation by a numeric column and then take
>>>>>>> top 5% of tuples.
>>>>>>> As far as I understand I cannot use an expression in the LIMIT
>>>>>>> operator. Do I have to write my own UDF? What type of UDF should I
>>> use
>>>>>>> then?
>>>>>>> 
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Ruslan Al-Fakikh
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Best Regards,
>>> Ruslan Al-Fakikh
>>> 
>> 
> 
> 
> 
> -- 
> Best Regards,
> Ruslan Al-Fakikh

Re: How to LIMIT a relation by percentage

Reply via email to