Is this what you want ? (using TOP and COUNT).
raw_data = load ... as (id:chararray, weight:float);
group_id = group raw_data by id;
filter_spec_id = filter group_id by group == '1';
-- COMMENTED OUT - count_spec_id = foreach filter_spec_id generate
COUNT(raw_data) as tot;
sample_id = foreach filter_spec_id {
order_weight = order raw_data by weight desc;
limit_id = TOP((int)SIZE(raw_data)/2, 1, order_weight);
generate limit_id;
}
---------
The use of variables will be supported for limit in 0.10 . But it is
supported only for scalar[1] variables. see -
https://issues.apache.org/jira/browse/PIG-1926
[1] see 'Casting Relations to Scalars' in
http://pig.apache.org/docs/r0.9.1/basic.html
It should be possible to add support for other variables in case of
limit in nested foreach statement.
But the way you used it can't be supported if there are multiple records
in count_spec_id, as the limit variable comes from a different relation,
and pig does not know which value from that relation should be used in
the limit.
-Thejas
On 12/2/11 5:45 PM, 唐亮 wrote:
Hi,
The pig codes are as below:
raw_data = load ... as (id:chararray, weight:float);
group_id = group raw_data by id;
filter_spec_id = filter group_id by group == '1';
count_spec_id = foreach filter_spec_id generate COUNT(raw_data) as tot;
sample_id = foreach filter_spec_id {
order_weight = order raw_data by weight desc;
limit_id = limit order_weight (int)count_spec_id.tot/2; -- *It's the
problem*
generate limit_id;
}
The compiler complain limit should be followed by<INTEGER>.
So, how can I limit the relation with a variable?