Are all these dupe events expected to be within the same hour or they can happen over multiple hours ?
Viral From: Rahul Ravindran Sent: 2/14/2013 11:41 AM To: [email protected] Subject: Using HBase for Deduping Hi, We have events which are delivered into our HDFS cluster which may be duplicated. Each event has a UUID and we were hoping to leverage HBase to dedupe them. We run a MapReduce job which would perform a lookup for each UUID on HBase and then emit the event only if the UUID was absent and would also insert into the HBase table(This is simplistic, I am missing out details to make this more resilient to failures). My concern is that doing a Read+Write for every event in MR would be slow (We expect around 1 Billion events every hour). Does anyone use Hbase for a similar use case or is there a different approach to achieving the same end result. Any information, comments would be great. Thanks, ~Rahul.
