Thanks. I will try your codes. But my requirement is that: Sort records by weight in descending order. And then select the top half records.
How can I implement the requirement? 在 2011年12月5日 上午3:30,Dmitriy Ryaboy <[email protected]>写道: > TOP is faster than order + limit if you call it in a way that doesn't > require the whole bag to be materialized on the reducer, which this > script does (on the map side, top does not yet know the size of the > bag, so it doesn't know how many elements to keep). > > Fiddle with your script until you see algebraic invocation happening > -- you probably need to move the filter above the group, for example. > > Something like this is a start: > > raw_data = load ... as (id:chararray, weight:float); > > -- manually moved the filter above the group > raw_data = filter raw_data by id == '1'; > > group_id = group raw_data by id; > > count_spec_id = foreach group_id generate COUNT(raw_data) as tot; > > -- make sure TOP only needs scalars and the grouped bag > sample_id = foreach group_id { > generate TOP( ((int)count_spec_id.tot)/2, 1, raw_data); > } > > > 2011/12/4 唐亮 <[email protected]>: > > Thank you Thejas Nair ! > > > > But I find the TOP operator works extremely slowly. > > > > And could you give me an example that uses variables in LIMIT? > > > > My pig's version is: > > $ pig -version > > Apache Pig version 0.8.0-cdh3u0 (rexported) > > compiled Mar 25 2011, 16:16:24 > > > > > > 2011/12/3 Thejas Nair <[email protected]> > > > >> Is this what you want ? (using TOP and COUNT). > >> > >> > >> raw_data = load ... as (id:chararray, weight:float); > >> group_id = group raw_data by id; > >> > >> filter_spec_id = filter group_id by group == '1'; > >> -- COMMENTED OUT - count_spec_id = foreach filter_spec_id generate > >> COUNT(raw_data) as tot; > >> > >> > >> sample_id = foreach filter_spec_id { > >> order_weight = order raw_data by weight desc; > >> limit_id = TOP((int)SIZE(raw_data)/2, 1, order_weight); > >> generate limit_id; > >> } > >> > >> --------- > >> > >> The use of variables will be supported for limit in 0.10 . But it is > >> supported only for scalar[1] variables. see - > https://issues.apache.org/** > >> jira/browse/PIG-1926 <https://issues.apache.org/jira/browse/PIG-1926> > >> > >> [1] see 'Casting Relations to Scalars' in > http://pig.apache.org/docs/r0.** > >> 9.1/basic.html <http://pig.apache.org/docs/r0.9.1/basic.html> > >> > >> It should be possible to add support for other variables in case of > limit > >> in nested foreach statement. > >> But the way you used it can't be supported if there are multiple records > >> in count_spec_id, as the limit variable comes from a different relation, > >> and pig does not know which value from that relation should be used in > the > >> limit. > >> > >> -Thejas > >> > >> > >> > >> > >> > >> > >> On 12/2/11 5:45 PM, 唐亮 wrote: > >> > >>> Hi, > >>> > >>> The pig codes are as below: > >>> > >>> raw_data = load ... as (id:chararray, weight:float); > >>> group_id = group raw_data by id; > >>> > >>> filter_spec_id = filter group_id by group == '1'; > >>> count_spec_id = foreach filter_spec_id generate COUNT(raw_data) as tot; > >>> > >>> sample_id = foreach filter_spec_id { > >>> order_weight = order raw_data by weight desc; > >>> limit_id = limit order_weight (int)count_spec_id.tot/2; -- *It's the > >>> problem* > >>> > >>> generate limit_id; > >>> } > >>> > >>> The compiler complain limit should be followed by<INTEGER>. > >>> So, how can I limit the relation with a variable? > >>> > >>> > >> >
