Is this what you want ? (using TOP and COUNT).

raw_data = load ... as (id:chararray, weight:float);
group_id = group raw_data by id;

filter_spec_id = filter group_id by group == '1';
-- COMMENTED OUT - count_spec_id = foreach filter_spec_id generate COUNT(raw_data) as tot;

sample_id = foreach filter_spec_id {
  order_weight = order raw_data by weight desc;
  limit_id = TOP((int)SIZE(raw_data)/2, 1, order_weight);
  generate limit_id;
}

---------

The use of variables will be supported for limit in 0.10 . But it is supported only for scalar[1] variables. see - https://issues.apache.org/jira/browse/PIG-1926

[1] see 'Casting Relations to Scalars' in http://pig.apache.org/docs/r0.9.1/basic.html

It should be possible to add support for other variables in case of limit in nested foreach statement. But the way you used it can't be supported if there are multiple records in count_spec_id, as the limit variable comes from a different relation, and pig does not know which value from that relation should be used in the limit.

-Thejas





On 12/2/11 5:45 PM, 唐亮 wrote:
Hi,

The pig codes are as below:

raw_data = load ... as (id:chararray, weight:float);
group_id = group raw_data by id;

filter_spec_id = filter group_id by group == '1';
count_spec_id = foreach filter_spec_id generate COUNT(raw_data) as tot;

sample_id = foreach filter_spec_id {
   order_weight = order raw_data by weight desc;
   limit_id = limit order_weight (int)count_spec_id.tot/2; -- *It's the
problem*
   generate limit_id;
}

The compiler complain limit should be followed by<INTEGER>.
So, how can I limit the relation with a variable?


Reply via email to