Or may be go with large value for max version and put the duplicate entry. Now in the compact, need to have a wrapper for InternalScanner and next() method return only the 1st KV out, removing the others... Even while scan also same kind of logic will be needed.. This will be good enough IMO especially when there wont be so many duplicate events for same rowkey.. That is why I asked some questions before....
I think this solution can be checked. -Anoop- ________________________________________ From: Asaf Mesika [[email protected]] Sent: Friday, February 15, 2013 3:06 PM To: [email protected] Cc: Rahul Ravindran Subject: Re: Using HBase for Deduping Then maybe he can place an event in the same rowkey but with a column qualifier which the time stamp of the event saved as long. Upon preCompact in a region observer he can filter out for any row all column but the first? On Friday, February 15, 2013, Anoop Sam John wrote: > When max versions set as 1 and duplicate key is added, the last added will > win removing the old. This is what you want Rahul? I think from his > explanation he needs the reverse way > > -Anoop- > ________________________________________ > From: Asaf Mesika [[email protected] <javascript:;>] > Sent: Friday, February 15, 2013 3:56 AM > To: [email protected] <javascript:;>; Rahul Ravindran > Subject: Re: Using HBase for Deduping > > You can load the events into an Hbase table, which has the event id as the > unique row key. You can define max versions of 1 to the column family thus > letting Hbase get rid of the duplicates for you during major compaction. > > > > On Thursday, February 14, 2013, Rahul Ravindran wrote: > > > Hi, > > We have events which are delivered into our HDFS cluster which may be > > duplicated. Each event has a UUID and we were hoping to leverage HBase to > > dedupe them. We run a MapReduce job which would perform a lookup for each > > UUID on HBase and then emit the event only if the UUID was absent and > would > > also insert into the HBase table(This is simplistic, I am missing out > > details to make this more resilient to failures). My concern is that > doing > > a Read+Write for every event in MR would be slow (We expect around 1 > > Billion events every hour). Does anyone use Hbase for a similar use case > or > > is there a different approach to achieving the same end result. Any > > information, comments would be great. > > > > Thanks, > > ~Rahul.
