Re: Distinct by column, generate tuple

Michael Lok Tue, 24 Jan 2012 23:57:02 -0800

Hi Prashant,

You're partially correct.  Based on the distinct values of the 1st
column, I also need to grab the other columns for each distinct
record.  Referring back to my original data set; the output should be:


10,234324234,NAME 1
20,397383737,NAME 2
30,439378283,NAME 3
40,439837434,NAME 4

Thanks.

On Wed, Jan 25, 2012 at 3:48 PM, Prashant Kommireddi
<[email protected]> wrote:
> Hi Michael,
>
> If I understand correctly you are trying to get the distinct 1st column
> elements from the dataset? Something like this:
>
> grunt> A = load 'aaa' using PigStorage(',');
> grunt> B = foreach A GENERATE $0;
> grunt> C = DISTINCT B;
> grunt> DUMP C;
>
> Thanks,
> Prashant
>
> On Tue, Jan 24, 2012 at 11:19 PM, Michael Lok <[email protected]> wrote:
>
>> Hi folks,
>>
>> I've got a dataset as below:
>>
>> 10,234324234,NAME 1,3
>> 10,346464646,NAME 1,3
>> 10,438389232,NAME 1,3
>> 20,397383737,NAME 2,4
>> 20,383783234,NAME 2,4
>> 20,387382828,NAME 2,4
>> 20,309323333,NAME 2,4
>> 30,439378283,NAME 3,2
>> 30,010191923,NAME 3,2
>> 40,439837434,NAME 4,4
>> 40,383723443,NAME 4,4
>> 40,100182321,NAME 4,4
>> 40,992173732,NAME 4,4
>>
>> I'd like to just print out the distinct records by column 1.  Here's
>> what I have:
>>
>> A = group FULL by $0;
>>
>> B = foreach FULL {
>>        C0 = FULL.$0;
>>        UC0 = DISTINCT C0;
>>        generate group, COUNT(UC0);
>> };
>>
>> The script above prints out only the first column and count (not
>> really required).  But I need to print out just a single tuple for
>> each of the distinct row.
>>
>> Is this possible?
>>
>> Any help is greatly appreciated.
>>
>>
>> Thanks!
>>

Re: Distinct by column, generate tuple

Reply via email to