Re: Distinct by column, generate tuple

Michael Lok Thu, 26 Jan 2012 19:32:23 -0800

Hi Grig,

Thanks for the code.  Works like a charm!


On Fri, Jan 27, 2012 at 6:23 AM, Grig Gheorghiu
<[email protected]> wrote:
> If you just want to obtain a sample value for column #1 and column #2
> associated with each unique value for column #0, this worked for me on
> your dataset:
>
> TEST = LOAD 'test.txt.gz' USING PigStorage(',') as (
>        id1: chararray,
>        id2: chararray,
>        name: chararray,
>        cnt: chararray);
>
> G = GROUP TEST BY id1;
> R = FOREACH G GENERATE FLATTEN(group), MAX(TEST.id2), MAX(TEST.name);
> DUMP R;
>
> I got
>
> (10,438389232,NAME 1)
> (20,397383737,NAME 2)
> (30,439378283,NAME 3)
> (40,992173732,NAME 4)
>
>
> You can replace MAX with some other aggregate function that works on
> strings or numbers.
>
> Grig
>
>
> On Tue, Jan 24, 2012 at 11:55 PM, Michael Lok <[email protected]> wrote:
>> Hi Prashant,
>>
>> You're partially correct.  Based on the distinct values of the 1st
>> column, I also need to grab the other columns for each distinct
>> record.  Referring back to my original data set; the output should be:
>>
>> 10,234324234,NAME 1
>> 20,397383737,NAME 2
>> 30,439378283,NAME 3
>> 40,439837434,NAME 4
>>
>> Thanks.
>>
>> On Wed, Jan 25, 2012 at 3:48 PM, Prashant Kommireddi
>> <[email protected]> wrote:
>>> Hi Michael,
>>>
>>> If I understand correctly you are trying to get the distinct 1st column
>>> elements from the dataset? Something like this:
>>>
>>> grunt> A = load 'aaa' using PigStorage(',');
>>> grunt> B = foreach A GENERATE $0;
>>> grunt> C = DISTINCT B;
>>> grunt> DUMP C;
>>>
>>> Thanks,
>>> Prashant
>>>
>>> On Tue, Jan 24, 2012 at 11:19 PM, Michael Lok <[email protected]> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> I've got a dataset as below:
>>>>
>>>> 10,234324234,NAME 1,3
>>>> 10,346464646,NAME 1,3
>>>> 10,438389232,NAME 1,3
>>>> 20,397383737,NAME 2,4
>>>> 20,383783234,NAME 2,4
>>>> 20,387382828,NAME 2,4
>>>> 20,309323333,NAME 2,4
>>>> 30,439378283,NAME 3,2
>>>> 30,010191923,NAME 3,2
>>>> 40,439837434,NAME 4,4
>>>> 40,383723443,NAME 4,4
>>>> 40,100182321,NAME 4,4
>>>> 40,992173732,NAME 4,4
>>>>
>>>> I'd like to just print out the distinct records by column 1.  Here's
>>>> what I have:
>>>>
>>>> A = group FULL by $0;
>>>>
>>>> B = foreach FULL {
>>>>        C0 = FULL.$0;
>>>>        UC0 = DISTINCT C0;
>>>>        generate group, COUNT(UC0);
>>>> };
>>>>
>>>> The script above prints out only the first column and count (not
>>>> really required).  But I need to print out just a single tuple for
>>>> each of the distinct row.
>>>>
>>>> Is this possible?
>>>>
>>>> Any help is greatly appreciated.
>>>>
>>>>
>>>> Thanks!
>>>>

Re: Distinct by column, generate tuple

Reply via email to