Hi Prashant, You're partially correct. Based on the distinct values of the 1st column, I also need to grab the other columns for each distinct record. Referring back to my original data set; the output should be:
10,234324234,NAME 1 20,397383737,NAME 2 30,439378283,NAME 3 40,439837434,NAME 4 Thanks. On Wed, Jan 25, 2012 at 3:48 PM, Prashant Kommireddi <[email protected]> wrote: > Hi Michael, > > If I understand correctly you are trying to get the distinct 1st column > elements from the dataset? Something like this: > > grunt> A = load 'aaa' using PigStorage(','); > grunt> B = foreach A GENERATE $0; > grunt> C = DISTINCT B; > grunt> DUMP C; > > Thanks, > Prashant > > On Tue, Jan 24, 2012 at 11:19 PM, Michael Lok <[email protected]> wrote: > >> Hi folks, >> >> I've got a dataset as below: >> >> 10,234324234,NAME 1,3 >> 10,346464646,NAME 1,3 >> 10,438389232,NAME 1,3 >> 20,397383737,NAME 2,4 >> 20,383783234,NAME 2,4 >> 20,387382828,NAME 2,4 >> 20,309323333,NAME 2,4 >> 30,439378283,NAME 3,2 >> 30,010191923,NAME 3,2 >> 40,439837434,NAME 4,4 >> 40,383723443,NAME 4,4 >> 40,100182321,NAME 4,4 >> 40,992173732,NAME 4,4 >> >> I'd like to just print out the distinct records by column 1. Here's >> what I have: >> >> A = group FULL by $0; >> >> B = foreach FULL { >> C0 = FULL.$0; >> UC0 = DISTINCT C0; >> generate group, COUNT(UC0); >> }; >> >> The script above prints out only the first column and count (not >> really required). But I need to print out just a single tuple for >> each of the distinct row. >> >> Is this possible? >> >> Any help is greatly appreciated. >> >> >> Thanks! >>
