Dear Jacob,

Many thanks! It worked perfect.

Regards,
Abhishek

-----Original Message-----
From: Jacob Perkins [mailto:[email protected]] 
Sent: 24 August 2013 23:49
To: [email protected]
Subject: Re: Dedupe Logic

Abhishek,

You should be able to do this by grouping by the three columns and then 
ordering by the fourth in a nested foreach.

eg:

data = load 'some_url' as (f11, f12, f13, f14);

deduped = foreach (group data by (f11,f12,f13)) {
            ordered = order data by f14 asc;
            one_rec = limit ordered 1;
            generate
              flatten(one_rec) as (f11, f2, f13, f14);
          };


--jacob
@thedatachef


On Sat, 2013-08-24 at 18:03 +0000, Ambastha, Abhishek wrote:
> Hi,
> 
> How can I sort and dedupe on multiple columns ?
> 
> I have a 5 GB file with 70 columns. I want to sort on four columns f11, f12, 
> f13 and f14. Then I want to dedupe on three columns f11, f12 and f13 so that 
> the minimum value of f14 is retained (that is pick up the first record after 
> sort). Please suggest how to do this.
> 
> Also, can this be done using rank function?
> 
> Regards,
> Abhishek


Reply via email to