Re: Distinct by column, generate tuple

Grig Gheorghiu Thu, 26 Jan 2012 14:24:11 -0800

If you just want to obtain a sample value for column #1 and column #2
associated with each unique value for column #0, this worked for me on
your dataset:


TEST = LOAD 'test.txt.gz' USING PigStorage(',') as (
        id1: chararray,
        id2: chararray,
        name: chararray,
        cnt: chararray);

G = GROUP TEST BY id1;
R = FOREACH G GENERATE FLATTEN(group), MAX(TEST.id2), MAX(TEST.name);
DUMP R;

I got

(10,438389232,NAME 1)
(20,397383737,NAME 2)
(30,439378283,NAME 3)
(40,992173732,NAME 4)


You can replace MAX with some other aggregate function that works on
strings or numbers.

Grig


On Tue, Jan 24, 2012 at 11:55 PM, Michael Lok <[email protected]> wrote:
> Hi Prashant,
>
> You're partially correct.  Based on the distinct values of the 1st
> column, I also need to grab the other columns for each distinct
> record.  Referring back to my original data set; the output should be:
>
> 10,234324234,NAME 1
> 20,397383737,NAME 2
> 30,439378283,NAME 3
> 40,439837434,NAME 4
>
> Thanks.
>
> On Wed, Jan 25, 2012 at 3:48 PM, Prashant Kommireddi
> <[email protected]> wrote:
>> Hi Michael,
>>
>> If I understand correctly you are trying to get the distinct 1st column
>> elements from the dataset? Something like this:
>>
>> grunt> A = load 'aaa' using PigStorage(',');
>> grunt> B = foreach A GENERATE $0;
>> grunt> C = DISTINCT B;
>> grunt> DUMP C;
>>
>> Thanks,
>> Prashant
>>
>> On Tue, Jan 24, 2012 at 11:19 PM, Michael Lok <[email protected]> wrote:
>>
>>> Hi folks,
>>>
>>> I've got a dataset as below:
>>>
>>> 10,234324234,NAME 1,3
>>> 10,346464646,NAME 1,3
>>> 10,438389232,NAME 1,3
>>> 20,397383737,NAME 2,4
>>> 20,383783234,NAME 2,4
>>> 20,387382828,NAME 2,4
>>> 20,309323333,NAME 2,4
>>> 30,439378283,NAME 3,2
>>> 30,010191923,NAME 3,2
>>> 40,439837434,NAME 4,4
>>> 40,383723443,NAME 4,4
>>> 40,100182321,NAME 4,4
>>> 40,992173732,NAME 4,4
>>>
>>> I'd like to just print out the distinct records by column 1.  Here's
>>> what I have:
>>>
>>> A = group FULL by $0;
>>>
>>> B = foreach FULL {
>>>        C0 = FULL.$0;
>>>        UC0 = DISTINCT C0;
>>>        generate group, COUNT(UC0);
>>> };
>>>
>>> The script above prints out only the first column and count (not
>>> really required).  But I need to print out just a single tuple for
>>> each of the distinct row.
>>>
>>> Is this possible?
>>>
>>> Any help is greatly appreciated.
>>>
>>>
>>> Thanks!
>>>

Re: Distinct by column, generate tuple

Reply via email to