Hi Grig, Thanks for the code. Works like a charm!
On Fri, Jan 27, 2012 at 6:23 AM, Grig Gheorghiu <[email protected]> wrote: > If you just want to obtain a sample value for column #1 and column #2 > associated with each unique value for column #0, this worked for me on > your dataset: > > TEST = LOAD 'test.txt.gz' USING PigStorage(',') as ( > id1: chararray, > id2: chararray, > name: chararray, > cnt: chararray); > > G = GROUP TEST BY id1; > R = FOREACH G GENERATE FLATTEN(group), MAX(TEST.id2), MAX(TEST.name); > DUMP R; > > I got > > (10,438389232,NAME 1) > (20,397383737,NAME 2) > (30,439378283,NAME 3) > (40,992173732,NAME 4) > > > You can replace MAX with some other aggregate function that works on > strings or numbers. > > Grig > > > On Tue, Jan 24, 2012 at 11:55 PM, Michael Lok <[email protected]> wrote: >> Hi Prashant, >> >> You're partially correct. Based on the distinct values of the 1st >> column, I also need to grab the other columns for each distinct >> record. Referring back to my original data set; the output should be: >> >> 10,234324234,NAME 1 >> 20,397383737,NAME 2 >> 30,439378283,NAME 3 >> 40,439837434,NAME 4 >> >> Thanks. >> >> On Wed, Jan 25, 2012 at 3:48 PM, Prashant Kommireddi >> <[email protected]> wrote: >>> Hi Michael, >>> >>> If I understand correctly you are trying to get the distinct 1st column >>> elements from the dataset? Something like this: >>> >>> grunt> A = load 'aaa' using PigStorage(','); >>> grunt> B = foreach A GENERATE $0; >>> grunt> C = DISTINCT B; >>> grunt> DUMP C; >>> >>> Thanks, >>> Prashant >>> >>> On Tue, Jan 24, 2012 at 11:19 PM, Michael Lok <[email protected]> wrote: >>> >>>> Hi folks, >>>> >>>> I've got a dataset as below: >>>> >>>> 10,234324234,NAME 1,3 >>>> 10,346464646,NAME 1,3 >>>> 10,438389232,NAME 1,3 >>>> 20,397383737,NAME 2,4 >>>> 20,383783234,NAME 2,4 >>>> 20,387382828,NAME 2,4 >>>> 20,309323333,NAME 2,4 >>>> 30,439378283,NAME 3,2 >>>> 30,010191923,NAME 3,2 >>>> 40,439837434,NAME 4,4 >>>> 40,383723443,NAME 4,4 >>>> 40,100182321,NAME 4,4 >>>> 40,992173732,NAME 4,4 >>>> >>>> I'd like to just print out the distinct records by column 1. Here's >>>> what I have: >>>> >>>> A = group FULL by $0; >>>> >>>> B = foreach FULL { >>>> C0 = FULL.$0; >>>> UC0 = DISTINCT C0; >>>> generate group, COUNT(UC0); >>>> }; >>>> >>>> The script above prints out only the first column and count (not >>>> really required). But I need to print out just a single tuple for >>>> each of the distinct row. >>>> >>>> Is this possible? >>>> >>>> Any help is greatly appreciated. >>>> >>>> >>>> Thanks! >>>>
