That's what top does.. it returns max n (without doing a total sort).

D

2011/12/4 唐亮 <[email protected]>:
> Thanks.
> I will try your codes.
>
> But my requirement is that:
> Sort records by weight in descending order.
> And then select the top half records.
>
> How can I implement the requirement?
>
>
> 在 2011年12月5日 上午3:30,Dmitriy Ryaboy <[email protected]>写道:
>
>> TOP is faster than order + limit if you call it in a way that doesn't
>> require the whole bag to be materialized on the reducer, which this
>> script does (on the map side, top does not yet know the size of the
>> bag, so it doesn't know how many elements to keep).
>>
>> Fiddle with your script until you see algebraic invocation happening
>> -- you probably need to move the filter above the group, for example.
>>
>> Something like this is a start:
>>
>> raw_data = load ... as (id:chararray, weight:float);
>>
>> -- manually moved the filter above the group
>> raw_data = filter raw_data by id == '1';
>>
>> group_id = group raw_data by id;
>>
>> count_spec_id = foreach group_id generate COUNT(raw_data) as tot;
>>
>> -- make sure TOP only needs scalars and the grouped bag
>> sample_id = foreach group_id {
>>  generate TOP( ((int)count_spec_id.tot)/2, 1,  raw_data);
>> }
>>
>>
>> 2011/12/4 唐亮 <[email protected]>:
>> > Thank you Thejas Nair !
>> >
>> > But I find the TOP operator works extremely slowly.
>> >
>> > And could you give me an example that uses variables in LIMIT?
>> >
>> > My pig's version is:
>> > $ pig -version
>> > Apache Pig version 0.8.0-cdh3u0 (rexported)
>> > compiled Mar 25 2011, 16:16:24
>> >
>> >
>> > 2011/12/3 Thejas Nair <[email protected]>
>> >
>> >> Is this what you want ? (using TOP and COUNT).
>> >>
>> >>
>> >> raw_data = load ... as (id:chararray, weight:float);
>> >> group_id = group raw_data by id;
>> >>
>> >> filter_spec_id = filter group_id by group == '1';
>> >> -- COMMENTED OUT - count_spec_id = foreach filter_spec_id generate
>> >> COUNT(raw_data) as tot;
>> >>
>> >>
>> >> sample_id = foreach filter_spec_id {
>> >>  order_weight = order raw_data by weight desc;
>> >>  limit_id = TOP((int)SIZE(raw_data)/2, 1, order_weight);
>> >>  generate limit_id;
>> >> }
>> >>
>> >> ---------
>> >>
>> >> The use of variables will be supported for limit in 0.10 . But it is
>> >> supported only for scalar[1] variables. see -
>> https://issues.apache.org/**
>> >> jira/browse/PIG-1926 <https://issues.apache.org/jira/browse/PIG-1926>
>> >>
>> >> [1] see 'Casting Relations to Scalars' in
>> http://pig.apache.org/docs/r0.**
>> >> 9.1/basic.html <http://pig.apache.org/docs/r0.9.1/basic.html>
>> >>
>> >> It should be possible to add support for other variables in case of
>> limit
>> >> in nested foreach statement.
>> >> But the way you used it can't be supported if there are multiple records
>> >> in count_spec_id, as the limit variable comes from a different relation,
>> >> and pig does not know which value from that relation should be used in
>> the
>> >> limit.
>> >>
>> >> -Thejas
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On 12/2/11 5:45 PM, 唐亮 wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> The pig codes are as below:
>> >>>
>> >>> raw_data = load ... as (id:chararray, weight:float);
>> >>> group_id = group raw_data by id;
>> >>>
>> >>> filter_spec_id = filter group_id by group == '1';
>> >>> count_spec_id = foreach filter_spec_id generate COUNT(raw_data) as tot;
>> >>>
>> >>> sample_id = foreach filter_spec_id {
>> >>>   order_weight = order raw_data by weight desc;
>> >>>   limit_id = limit order_weight (int)count_spec_id.tot/2; -- *It's the
>> >>> problem*
>> >>>
>> >>>   generate limit_id;
>> >>> }
>> >>>
>> >>> The compiler complain limit should be followed by<INTEGER>.
>> >>> So, how can I limit the relation with a variable?
>> >>>
>> >>>
>> >>
>>

Reply via email to