Re: Using HBase for Deduping

Asaf Mesika Fri, 15 Feb 2013 01:36:40 -0800

Then maybe he can place an event in the same rowkey but with a column
qualifier which the time stamp of the event saved as long. Upon preCompact
in a region observer he can filter out for any row all column but the first?


On Friday, February 15, 2013, Anoop Sam John wrote:

> When max versions set as 1 and duplicate key is added, the last added will
> win removing the old.  This is what you want Rahul?  I think from his
> explanation he needs the reverse way
>
> -Anoop-
> ________________________________________
> From: Asaf Mesika [[email protected] <javascript:;>]
> Sent: Friday, February 15, 2013 3:56 AM
> To: [email protected] <javascript:;>; Rahul Ravindran
> Subject: Re: Using HBase for Deduping
>
> You can load the events into an Hbase table, which has the event id as the
> unique row key. You can define max versions of 1 to the column family thus
> letting Hbase get rid of the duplicates for you during major compaction.
>
>
>
> On Thursday, February 14, 2013, Rahul Ravindran wrote:
>
> > Hi,
> >    We have events which are delivered into our HDFS cluster which may be
> > duplicated. Each event has a UUID and we were hoping to leverage HBase to
> > dedupe them. We run a MapReduce job which would perform a lookup for each
> > UUID on HBase and then emit the event only if the UUID was absent and
> would
> > also insert into the HBase table(This is simplistic, I am missing out
> > details to make this more resilient to failures). My concern is that
> doing
> > a Read+Write for every event in MR would be slow (We expect around 1
> > Billion events every hour). Does anyone use Hbase for a similar use case
> or
> > is there a different approach to achieving the same end result. Any
> > information, comments would be great.
> >
> > Thanks,
> > ~Rahul.

Re: Using HBase for Deduping

Reply via email to