We use com.facebook.hive.udf.UDFNumberRows to do a ranking by time in some of our queries. You could do that, and then do another select where the row number/rank is 1 to get all the "unique" rows.
There are probably a bunch of other ways to do this, but this is the one that first came to mind for me…. Enjoy! Bob Robert Gause Senior Systems Engineer ZyQuest, Inc. bob.ga...@zyquest.com On Aug 17, 2012, at 9:49 AM, Himanish Kushary wrote: > Hi, > > We have a huge table which may have duplicate records.A record is considered > duplicate based on 4 fields ( fld1 thru fld4) . We need to identify the > duplicate records and possibly mark the duplicates(except the first record > based on created time for a record). > > Is this something that could be done by hive or we need to write custom M/R > for this.Could a inner join or a select with group by be used to find the > duplicates ? How do I mark the duplicate records as there is no update. > > Whats the best way to do this using Hive ? Looking forward to hear the > suggestions. > > Thanks