Then maybe he can place an event in the same rowkey but with a column qualifier which the time stamp of the event saved as long. Upon preCompact in a region observer he can filter out for any row all column but the first?
On Friday, February 15, 2013, Anoop Sam John wrote: > When max versions set as 1 and duplicate key is added, the last added will > win removing the old. This is what you want Rahul? I think from his > explanation he needs the reverse way > > -Anoop- > ________________________________________ > From: Asaf Mesika [[email protected] <javascript:;>] > Sent: Friday, February 15, 2013 3:56 AM > To: [email protected] <javascript:;>; Rahul Ravindran > Subject: Re: Using HBase for Deduping > > You can load the events into an Hbase table, which has the event id as the > unique row key. You can define max versions of 1 to the column family thus > letting Hbase get rid of the duplicates for you during major compaction. > > > > On Thursday, February 14, 2013, Rahul Ravindran wrote: > > > Hi, > > We have events which are delivered into our HDFS cluster which may be > > duplicated. Each event has a UUID and we were hoping to leverage HBase to > > dedupe them. We run a MapReduce job which would perform a lookup for each > > UUID on HBase and then emit the event only if the UUID was absent and > would > > also insert into the HBase table(This is simplistic, I am missing out > > details to make this more resilient to failures). My concern is that > doing > > a Read+Write for every event in MR would be slow (We expect around 1 > > Billion events every hour). Does anyone use Hbase for a similar use case > or > > is there a different approach to achieving the same end result. Any > > information, comments would be great. > > > > Thanks, > > ~Rahul.
