Thanks Jacob. I wonder if it is possible to get the rank of each record or say row number using Pig. Or do I need to have an external driver like a shell script which augments the sorted output from Pig with a rank?
Thanks Arun On Tue, Apr 26, 2011 at 7:54 PM, Jacob Perkins <[email protected]>wrote: > What you've indicated does require access to the whole relation at once > or at least a way of incrementing a counter and assigning its value to > each tuple. This kind of shared/synchronized state isn't possible with > Pig at the moment as far as I know. > > --jacob > @thedatachef > > On Tue, 2011-04-26 at 19:43 -0700, Arun A K wrote: > > Thanks Jacob for the response. > > > > If I run the UDF on each tuple then how can I preserve the state of the > rank > > variable. I mean the UDF won't be able to save the rank value between > calls, > > right? Correct me if I am wrong in interpreting that the UDF would be > > invoked for each tuple. > > > > What I am looking in my output is an additional column indicating the > rank. > > Something like > > > > Hick 35 1 > > Jimmy 30 2 > > Jack 25 3 > > Tampa 22 4 > > Sam 20 5 > > > > Thanks. > > > > Arun > > > > > > On Tue, Apr 26, 2011 at 7:18 PM, Jacob Perkins < > [email protected]>wrote: > > > > > The question is, do you need the entire relation all at once to assign > a > > > rank? If so then map-reduce may not be the answer. If not, why not just > > > run the UDF on each tuple of the relation, one at a time, with a > > > projection? > > > > > > If you need some global information, such as the max and min score, > then > > > you might look at the MAX and MIN operations. They do require a GROUP > > > ALL but are algebraic so it's not actually going to bring all the data > > > to one machine as it otherwise would. > > > > > > --jacob > > > @thedatachef > > > > > > > > > On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote: > > > > Hi > > > > > > > > I have the following input relation: > > > > Name Score > > > > Jack 25 > > > > Jimmy 30 > > > > Sam 20 > > > > Hick 35 > > > > Tampa 22 > > > > > > > > My goal is to rank the tuples by score. > > > > > > > > Pig script: > > > > > > > > sample_data = LOAD 'sample.txt' USING PigStorage() AS > (name:chararray, > > > > score:int); > > > > sample_data_group = GROUP sample_data BY score; > > > > sample_data_count = FOREACH sample_data_group GENERATE group AS > score, > > > > COUNT(sample_data.name) AS countVal; > > > > sample_data_order = ORDER sample_data_count BY score DESC; > > > > sample_data_group_all = GROUP sample_data_order all; > > > > sample_data_project = FOREACH sample_data_group_all GENERATE > > > > FLATTEN(myUDF.Rank(sample_data_order)); > > > > dump sample_data_project; > > > > > > > > Can someone please point me to a UDF example where a relation is read > in > > > and > > > > iterated over all its tuples? I plan to iterate over the tuples and > > > assign a > > > > rank to each of them based on the score value. > > > > > > > > Is there any other way to generate rank? > > > > > > > > Thanks much. > > > > > > > > Arun > > > > > > > > > > > >
