If the whole set is not that big, sorting in shell might be the easiest. I've done that with result set of millions of records.
On Apr 26, 2011, at 8:49 PM, Arun A K <[email protected]> wrote: > Thanks Jacob. > > I wonder if it is possible to get the rank of each record or say row number > using Pig. Or do I need to have an external driver like a shell script which > augments the sorted output from Pig with a rank? > > Thanks > Arun > > > > On Tue, Apr 26, 2011 at 7:54 PM, Jacob Perkins > <[email protected]>wrote: > >> What you've indicated does require access to the whole relation at once >> or at least a way of incrementing a counter and assigning its value to >> each tuple. This kind of shared/synchronized state isn't possible with >> Pig at the moment as far as I know. >> >> --jacob >> @thedatachef >> >> On Tue, 2011-04-26 at 19:43 -0700, Arun A K wrote: >>> Thanks Jacob for the response. >>> >>> If I run the UDF on each tuple then how can I preserve the state of the >> rank >>> variable. I mean the UDF won't be able to save the rank value between >> calls, >>> right? Correct me if I am wrong in interpreting that the UDF would be >>> invoked for each tuple. >>> >>> What I am looking in my output is an additional column indicating the >> rank. >>> Something like >>> >>> Hick 35 1 >>> Jimmy 30 2 >>> Jack 25 3 >>> Tampa 22 4 >>> Sam 20 5 >>> >>> Thanks. >>> >>> Arun >>> >>> >>> On Tue, Apr 26, 2011 at 7:18 PM, Jacob Perkins < >> [email protected]>wrote: >>> >>>> The question is, do you need the entire relation all at once to assign >> a >>>> rank? If so then map-reduce may not be the answer. If not, why not just >>>> run the UDF on each tuple of the relation, one at a time, with a >>>> projection? >>>> >>>> If you need some global information, such as the max and min score, >> then >>>> you might look at the MAX and MIN operations. They do require a GROUP >>>> ALL but are algebraic so it's not actually going to bring all the data >>>> to one machine as it otherwise would. >>>> >>>> --jacob >>>> @thedatachef >>>> >>>> >>>> On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote: >>>>> Hi >>>>> >>>>> I have the following input relation: >>>>> Name Score >>>>> Jack 25 >>>>> Jimmy 30 >>>>> Sam 20 >>>>> Hick 35 >>>>> Tampa 22 >>>>> >>>>> My goal is to rank the tuples by score. >>>>> >>>>> Pig script: >>>>> >>>>> sample_data = LOAD 'sample.txt' USING PigStorage() AS >> (name:chararray, >>>>> score:int); >>>>> sample_data_group = GROUP sample_data BY score; >>>>> sample_data_count = FOREACH sample_data_group GENERATE group AS >> score, >>>>> COUNT(sample_data.name) AS countVal; >>>>> sample_data_order = ORDER sample_data_count BY score DESC; >>>>> sample_data_group_all = GROUP sample_data_order all; >>>>> sample_data_project = FOREACH sample_data_group_all GENERATE >>>>> FLATTEN(myUDF.Rank(sample_data_order)); >>>>> dump sample_data_project; >>>>> >>>>> Can someone please point me to a UDF example where a relation is read >> in >>>> and >>>>> iterated over all its tuples? I plan to iterate over the tuples and >>>> assign a >>>>> rank to each of them based on the score value. >>>>> >>>>> Is there any other way to generate rank? >>>>> >>>>> Thanks much. >>>>> >>>>> Arun >>>> >>>> >>>> >> >> >>
