TOP is faster than order + limit if you call it in a way that doesn't
require the whole bag to be materialized on the reducer, which this
script does (on the map side, top does not yet know the size of the
bag, so it doesn't know how many elements to keep).

Fiddle with your script until you see algebraic invocation happening
-- you probably need to move the filter above the group, for example.

Something like this is a start:

raw_data = load ... as (id:chararray, weight:float);

-- manually moved the filter above the group
raw_data = filter raw_data by id == '1';

group_id = group raw_data by id;

count_spec_id = foreach group_id generate COUNT(raw_data) as tot;

-- make sure TOP only needs scalars and the grouped bag
sample_id = foreach group_id {
 generate TOP( ((int)count_spec_id.tot)/2, 1,  raw_data);
}


2011/12/4 唐亮 <[email protected]>:
> Thank you Thejas Nair !
>
> But I find the TOP operator works extremely slowly.
>
> And could you give me an example that uses variables in LIMIT?
>
> My pig's version is:
> $ pig -version
> Apache Pig version 0.8.0-cdh3u0 (rexported)
> compiled Mar 25 2011, 16:16:24
>
>
> 2011/12/3 Thejas Nair <[email protected]>
>
>> Is this what you want ? (using TOP and COUNT).
>>
>>
>> raw_data = load ... as (id:chararray, weight:float);
>> group_id = group raw_data by id;
>>
>> filter_spec_id = filter group_id by group == '1';
>> -- COMMENTED OUT - count_spec_id = foreach filter_spec_id generate
>> COUNT(raw_data) as tot;
>>
>>
>> sample_id = foreach filter_spec_id {
>>  order_weight = order raw_data by weight desc;
>>  limit_id = TOP((int)SIZE(raw_data)/2, 1, order_weight);
>>  generate limit_id;
>> }
>>
>> ---------
>>
>> The use of variables will be supported for limit in 0.10 . But it is
>> supported only for scalar[1] variables. see - https://issues.apache.org/**
>> jira/browse/PIG-1926 <https://issues.apache.org/jira/browse/PIG-1926>
>>
>> [1] see 'Casting Relations to Scalars' in http://pig.apache.org/docs/r0.**
>> 9.1/basic.html <http://pig.apache.org/docs/r0.9.1/basic.html>
>>
>> It should be possible to add support for other variables in case of limit
>> in nested foreach statement.
>> But the way you used it can't be supported if there are multiple records
>> in count_spec_id, as the limit variable comes from a different relation,
>> and pig does not know which value from that relation should be used in the
>> limit.
>>
>> -Thejas
>>
>>
>>
>>
>>
>>
>> On 12/2/11 5:45 PM, 唐亮 wrote:
>>
>>> Hi,
>>>
>>> The pig codes are as below:
>>>
>>> raw_data = load ... as (id:chararray, weight:float);
>>> group_id = group raw_data by id;
>>>
>>> filter_spec_id = filter group_id by group == '1';
>>> count_spec_id = foreach filter_spec_id generate COUNT(raw_data) as tot;
>>>
>>> sample_id = foreach filter_spec_id {
>>>   order_weight = order raw_data by weight desc;
>>>   limit_id = limit order_weight (int)count_spec_id.tot/2; -- *It's the
>>> problem*
>>>
>>>   generate limit_id;
>>> }
>>>
>>> The compiler complain limit should be followed by<INTEGER>.
>>> So, how can I limit the relation with a variable?
>>>
>>>
>>

Reply via email to