a = load 'thing' as (x:int,y:int,z:int);
b = group a all;
describe b; -- it's going to be a bag
c = foreach b generate TOP(10,0,b);

the 0 means to compare based on the 0th element (in this case, x)

you could alternately do
c = foreach b generate TOP(10,1,b);

to compare by y

and

c = foreach b generate TOP(10,2,b);

to compare by z

It's on my todo list to let you specify a column name instead of having to
give an index, but for now this is how it goes.

2011/12/5 唐亮 <[email protected]>

> What does the second parameter mean?
>
>
> 在 2011年12月6日 上午6:27,Dmitriy Ryaboy <[email protected]>写道:
>
> > TOP ($number_of_elements_to_keep, $index_of_field_to_compare,
> > $bag_of_tuples)
> >
> > 2011/12/5 唐亮 <[email protected]>:
> > > Can you tell me the parameters' meanings of TOP operator?
> > >
> > > 在 2011年12月5日 下午3:32,Dmitriy Ryaboy <[email protected]>写道:
> > >
> > >> That's what top does.. it returns max n (without doing a total sort).
> > >>
> > >> D
> > >>
> > >> 2011/12/4 唐亮 <[email protected]>:
> > >> > Thanks.
> > >> > I will try your codes.
> > >> >
> > >> > But my requirement is that:
> > >> > Sort records by weight in descending order.
> > >> > And then select the top half records.
> > >> >
> > >> > How can I implement the requirement?
> > >> >
> > >> >
> > >> > 在 2011年12月5日 上午3:30,Dmitriy Ryaboy <[email protected]>写道:
> > >> >
> > >> >> TOP is faster than order + limit if you call it in a way that
> doesn't
> > >> >> require the whole bag to be materialized on the reducer, which this
> > >> >> script does (on the map side, top does not yet know the size of the
> > >> >> bag, so it doesn't know how many elements to keep).
> > >> >>
> > >> >> Fiddle with your script until you see algebraic invocation
> happening
> > >> >> -- you probably need to move the filter above the group, for
> example.
> > >> >>
> > >> >> Something like this is a start:
> > >> >>
> > >> >> raw_data = load ... as (id:chararray, weight:float);
> > >> >>
> > >> >> -- manually moved the filter above the group
> > >> >> raw_data = filter raw_data by id == '1';
> > >> >>
> > >> >> group_id = group raw_data by id;
> > >> >>
> > >> >> count_spec_id = foreach group_id generate COUNT(raw_data) as tot;
> > >> >>
> > >> >> -- make sure TOP only needs scalars and the grouped bag
> > >> >> sample_id = foreach group_id {
> > >> >>  generate TOP( ((int)count_spec_id.tot)/2, 1,  raw_data);
> > >> >> }
> > >> >>
> > >> >>
> > >> >> 2011/12/4 唐亮 <[email protected]>:
> > >> >> > Thank you Thejas Nair !
> > >> >> >
> > >> >> > But I find the TOP operator works extremely slowly.
> > >> >> >
> > >> >> > And could you give me an example that uses variables in LIMIT?
> > >> >> >
> > >> >> > My pig's version is:
> > >> >> > $ pig -version
> > >> >> > Apache Pig version 0.8.0-cdh3u0 (rexported)
> > >> >> > compiled Mar 25 2011, 16:16:24
> > >> >> >
> > >> >> >
> > >> >> > 2011/12/3 Thejas Nair <[email protected]>
> > >> >> >
> > >> >> >> Is this what you want ? (using TOP and COUNT).
> > >> >> >>
> > >> >> >>
> > >> >> >> raw_data = load ... as (id:chararray, weight:float);
> > >> >> >> group_id = group raw_data by id;
> > >> >> >>
> > >> >> >> filter_spec_id = filter group_id by group == '1';
> > >> >> >> -- COMMENTED OUT - count_spec_id = foreach filter_spec_id
> generate
> > >> >> >> COUNT(raw_data) as tot;
> > >> >> >>
> > >> >> >>
> > >> >> >> sample_id = foreach filter_spec_id {
> > >> >> >>  order_weight = order raw_data by weight desc;
> > >> >> >>  limit_id = TOP((int)SIZE(raw_data)/2, 1, order_weight);
> > >> >> >>  generate limit_id;
> > >> >> >> }
> > >> >> >>
> > >> >> >> ---------
> > >> >> >>
> > >> >> >> The use of variables will be supported for limit in 0.10 . But
> it
> > is
> > >> >> >> supported only for scalar[1] variables. see -
> > >> >> https://issues.apache.org/**
> > >> >> >> jira/browse/PIG-1926 <
> > https://issues.apache.org/jira/browse/PIG-1926
> > >> >
> > >> >> >>
> > >> >> >> [1] see 'Casting Relations to Scalars' in
> > >> >> http://pig.apache.org/docs/r0.**
> > >> >> >> 9.1/basic.html <http://pig.apache.org/docs/r0.9.1/basic.html>
> > >> >> >>
> > >> >> >> It should be possible to add support for other variables in case
> > of
> > >> >> limit
> > >> >> >> in nested foreach statement.
> > >> >> >> But the way you used it can't be supported if there are multiple
> > >> records
> > >> >> >> in count_spec_id, as the limit variable comes from a different
> > >> relation,
> > >> >> >> and pig does not know which value from that relation should be
> > used
> > >> in
> > >> >> the
> > >> >> >> limit.
> > >> >> >>
> > >> >> >> -Thejas
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> On 12/2/11 5:45 PM, 唐亮 wrote:
> > >> >> >>
> > >> >> >>> Hi,
> > >> >> >>>
> > >> >> >>> The pig codes are as below:
> > >> >> >>>
> > >> >> >>> raw_data = load ... as (id:chararray, weight:float);
> > >> >> >>> group_id = group raw_data by id;
> > >> >> >>>
> > >> >> >>> filter_spec_id = filter group_id by group == '1';
> > >> >> >>> count_spec_id = foreach filter_spec_id generate COUNT(raw_data)
> > as
> > >> tot;
> > >> >> >>>
> > >> >> >>> sample_id = foreach filter_spec_id {
> > >> >> >>>   order_weight = order raw_data by weight desc;
> > >> >> >>>   limit_id = limit order_weight (int)count_spec_id.tot/2; --
> > *It's
> > >> the
> > >> >> >>> problem*
> > >> >> >>>
> > >> >> >>>   generate limit_id;
> > >> >> >>> }
> > >> >> >>>
> > >> >> >>> The compiler complain limit should be followed by<INTEGER>.
> > >> >> >>> So, how can I limit the relation with a variable?
> > >> >> >>>
> > >> >> >>>
> > >> >> >>
> > >> >>
> > >>
> >
>

Reply via email to