The question is, do you need the entire relation all at once to assign a rank? If so then map-reduce may not be the answer. If not, why not just run the UDF on each tuple of the relation, one at a time, with a projection?
If you need some global information, such as the max and min score, then you might look at the MAX and MIN operations. They do require a GROUP ALL but are algebraic so it's not actually going to bring all the data to one machine as it otherwise would. --jacob @thedatachef On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote: > Hi > > I have the following input relation: > Name Score > Jack 25 > Jimmy 30 > Sam 20 > Hick 35 > Tampa 22 > > My goal is to rank the tuples by score. > > Pig script: > > sample_data = LOAD 'sample.txt' USING PigStorage() AS (name:chararray, > score:int); > sample_data_group = GROUP sample_data BY score; > sample_data_count = FOREACH sample_data_group GENERATE group AS score, > COUNT(sample_data.name) AS countVal; > sample_data_order = ORDER sample_data_count BY score DESC; > sample_data_group_all = GROUP sample_data_order all; > sample_data_project = FOREACH sample_data_group_all GENERATE > FLATTEN(myUDF.Rank(sample_data_order)); > dump sample_data_project; > > Can someone please point me to a UDF example where a relation is read in and > iterated over all its tuples? I plan to iterate over the tuples and assign a > rank to each of them based on the score value. > > Is there any other way to generate rank? > > Thanks much. > > Arun
