Re: Can I pass an entire relation to a Pig UDF?

Arun A K Tue, 26 Apr 2011 20:50:48 -0700

Thanks Jacob.

I wonder if it is possible to get the rank of each record or say row number
using Pig. Or do I need to have an external driver like a shell script which
augments the sorted output from Pig with a rank?


Thanks
Arun



On Tue, Apr 26, 2011 at 7:54 PM, Jacob Perkins <[email protected]>wrote:

> What you've indicated does require access to the whole relation at once
> or at least a way of incrementing a counter and assigning its value to
> each tuple. This kind of shared/synchronized state isn't possible with
> Pig at the moment as far as I know.
>
> --jacob
> @thedatachef
>
> On Tue, 2011-04-26 at 19:43 -0700, Arun A K wrote:
> > Thanks Jacob for the response.
> >
> > If I run the UDF on each tuple then how can I preserve the state of the
> rank
> > variable. I mean the UDF won't be able to save the rank value between
> calls,
> > right? Correct me if I am wrong in interpreting that the UDF would be
> > invoked for each tuple.
> >
> > What I am looking in my output is an additional column indicating the
> rank.
> > Something like
> >
> > Hick    35      1
> > Jimmy   30    2
> > Jack    25      3
> > Tampa   22    4
> > Sam     20     5
> >
> > Thanks.
> >
> > Arun
> >
> >
> > On Tue, Apr 26, 2011 at 7:18 PM, Jacob Perkins <
> [email protected]>wrote:
> >
> > > The question is, do you need the entire relation all at once to assign
> a
> > > rank? If so then map-reduce may not be the answer. If not, why not just
> > > run the UDF on each tuple of the relation, one at a time, with a
> > > projection?
> > >
> > > If you need some global information, such as the max and min score,
> then
> > > you might look at the MAX and MIN operations. They do require a GROUP
> > > ALL but are algebraic so it's not actually going to bring all the data
> > > to one machine as it otherwise would.
> > >
> > > --jacob
> > > @thedatachef
> > >
> > >
> > > On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
> > > > Hi
> > > >
> > > > I have the following input relation:
> > > > Name Score
> > > > Jack    25
> > > > Jimmy   30
> > > > Sam     20
> > > > Hick    35
> > > > Tampa   22
> > > >
> > > > My goal is to rank the tuples by score.
> > > >
> > > > Pig script:
> > > >
> > > > sample_data = LOAD 'sample.txt' USING PigStorage()   AS
> (name:chararray,
> > > > score:int);
> > > > sample_data_group = GROUP sample_data BY score;
> > > > sample_data_count = FOREACH sample_data_group GENERATE group AS
> score,
> > > > COUNT(sample_data.name) AS countVal;
> > > > sample_data_order = ORDER sample_data_count BY score DESC;
> > > > sample_data_group_all = GROUP sample_data_order all;
> > > > sample_data_project = FOREACH sample_data_group_all GENERATE
> > > > FLATTEN(myUDF.Rank(sample_data_order));
> > > > dump sample_data_project;
> > > >
> > > > Can someone please point me to a UDF example where a relation is read
> in
> > > and
> > > > iterated over all its tuples? I plan to iterate over the tuples and
> > > assign a
> > > > rank to each of them based on the score value.
> > > >
> > > > Is there any other way to generate rank?
> > > >
> > > > Thanks much.
> > > >
> > > > Arun
> > >
> > >
> > >
>
>
>

Re: Can I pass an entire relation to a Pig UDF?

Reply via email to