We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in some of 
our queries. You could do that, and then do another select where the row 
number/rank is 1 to get all the "unique" rows.

There are probably a bunch of other ways to do this, but this is the one that 
first came to mind for me….

Enjoy!
Bob

Robert Gause
Senior Systems Engineer
ZyQuest, Inc.
bob.ga...@zyquest.com

On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote:

> Hi,
> 
> We have a huge table which may have duplicate records.A record is considered 
> duplicate based on 4 fields ( fld1 thru fld4) . We need to identify the 
> duplicate records and possibly mark the duplicates(except the first record 
> based on created time for a record).
> 
> Is this something that could be done by hive or we need to write custom M/R 
> for this.Could a inner join or a select with group by be used to find the 
> duplicates ? How do I mark the duplicate records as there is no update.
> 
> Whats the best way to do this using Hive ? Looking forward to hear the 
> suggestions.
> 
> Thanks

Reply via email to