What does the second parameter mean?
在 2011年12月6日 上午6:27,Dmitriy Ryaboy <[email protected]>写道: > TOP ($number_of_elements_to_keep, $index_of_field_to_compare, > $bag_of_tuples) > > 2011/12/5 唐亮 <[email protected]>: > > Can you tell me the parameters' meanings of TOP operator? > > > > 在 2011年12月5日 下午3:32,Dmitriy Ryaboy <[email protected]>写道: > > > >> That's what top does.. it returns max n (without doing a total sort). > >> > >> D > >> > >> 2011/12/4 唐亮 <[email protected]>: > >> > Thanks. > >> > I will try your codes. > >> > > >> > But my requirement is that: > >> > Sort records by weight in descending order. > >> > And then select the top half records. > >> > > >> > How can I implement the requirement? > >> > > >> > > >> > 在 2011年12月5日 上午3:30,Dmitriy Ryaboy <[email protected]>写道: > >> > > >> >> TOP is faster than order + limit if you call it in a way that doesn't > >> >> require the whole bag to be materialized on the reducer, which this > >> >> script does (on the map side, top does not yet know the size of the > >> >> bag, so it doesn't know how many elements to keep). > >> >> > >> >> Fiddle with your script until you see algebraic invocation happening > >> >> -- you probably need to move the filter above the group, for example. > >> >> > >> >> Something like this is a start: > >> >> > >> >> raw_data = load ... as (id:chararray, weight:float); > >> >> > >> >> -- manually moved the filter above the group > >> >> raw_data = filter raw_data by id == '1'; > >> >> > >> >> group_id = group raw_data by id; > >> >> > >> >> count_spec_id = foreach group_id generate COUNT(raw_data) as tot; > >> >> > >> >> -- make sure TOP only needs scalars and the grouped bag > >> >> sample_id = foreach group_id { > >> >> generate TOP( ((int)count_spec_id.tot)/2, 1, raw_data); > >> >> } > >> >> > >> >> > >> >> 2011/12/4 唐亮 <[email protected]>: > >> >> > Thank you Thejas Nair ! > >> >> > > >> >> > But I find the TOP operator works extremely slowly. > >> >> > > >> >> > And could you give me an example that uses variables in LIMIT? > >> >> > > >> >> > My pig's version is: > >> >> > $ pig -version > >> >> > Apache Pig version 0.8.0-cdh3u0 (rexported) > >> >> > compiled Mar 25 2011, 16:16:24 > >> >> > > >> >> > > >> >> > 2011/12/3 Thejas Nair <[email protected]> > >> >> > > >> >> >> Is this what you want ? (using TOP and COUNT). > >> >> >> > >> >> >> > >> >> >> raw_data = load ... as (id:chararray, weight:float); > >> >> >> group_id = group raw_data by id; > >> >> >> > >> >> >> filter_spec_id = filter group_id by group == '1'; > >> >> >> -- COMMENTED OUT - count_spec_id = foreach filter_spec_id generate > >> >> >> COUNT(raw_data) as tot; > >> >> >> > >> >> >> > >> >> >> sample_id = foreach filter_spec_id { > >> >> >> order_weight = order raw_data by weight desc; > >> >> >> limit_id = TOP((int)SIZE(raw_data)/2, 1, order_weight); > >> >> >> generate limit_id; > >> >> >> } > >> >> >> > >> >> >> --------- > >> >> >> > >> >> >> The use of variables will be supported for limit in 0.10 . But it > is > >> >> >> supported only for scalar[1] variables. see - > >> >> https://issues.apache.org/** > >> >> >> jira/browse/PIG-1926 < > https://issues.apache.org/jira/browse/PIG-1926 > >> > > >> >> >> > >> >> >> [1] see 'Casting Relations to Scalars' in > >> >> http://pig.apache.org/docs/r0.** > >> >> >> 9.1/basic.html <http://pig.apache.org/docs/r0.9.1/basic.html> > >> >> >> > >> >> >> It should be possible to add support for other variables in case > of > >> >> limit > >> >> >> in nested foreach statement. > >> >> >> But the way you used it can't be supported if there are multiple > >> records > >> >> >> in count_spec_id, as the limit variable comes from a different > >> relation, > >> >> >> and pig does not know which value from that relation should be > used > >> in > >> >> the > >> >> >> limit. > >> >> >> > >> >> >> -Thejas > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> On 12/2/11 5:45 PM, 唐亮 wrote: > >> >> >> > >> >> >>> Hi, > >> >> >>> > >> >> >>> The pig codes are as below: > >> >> >>> > >> >> >>> raw_data = load ... as (id:chararray, weight:float); > >> >> >>> group_id = group raw_data by id; > >> >> >>> > >> >> >>> filter_spec_id = filter group_id by group == '1'; > >> >> >>> count_spec_id = foreach filter_spec_id generate COUNT(raw_data) > as > >> tot; > >> >> >>> > >> >> >>> sample_id = foreach filter_spec_id { > >> >> >>> order_weight = order raw_data by weight desc; > >> >> >>> limit_id = limit order_weight (int)count_spec_id.tot/2; -- > *It's > >> the > >> >> >>> problem* > >> >> >>> > >> >> >>> generate limit_id; > >> >> >>> } > >> >> >>> > >> >> >>> The compiler complain limit should be followed by<INTEGER>. > >> >> >>> So, how can I limit the relation with a variable? > >> >> >>> > >> >> >>> > >> >> >> > >> >> > >> >
