Re: Using HBase for Deduping

Asaf Mesika Fri, 15 Feb 2013 01:08:35 -0800

Michael, this means read for every write?

On Friday, February 15, 2013, Michael Segel wrote:


> What constitutes a duplicate?
>
> An over simplification is to do a HTable.checkAndPut() where you do the
> put if the column doesn't exist.
> Then if the row is inserted (TRUE) return value, you push the event.
>
> That will do what you want.
>
> At least at first blush.
>
>
>
> On Feb 14, 2013, at 3:24 PM, Viral Bajaria <[email protected]>
> wrote:
>
> > Given the size of the data (> 1B rows) and the frequency of job run (once
> > per hour), I don't think your most optimal solution is to lookup HBase
> for
> > every single event. You will benefit more by loading the HBase table
> > directly in your MR job.
> >
> > In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique
> UUID's ?
> >
> > Also once you have done the unique, are you going to use the data again
> in
> > some other way i.e. online serving of traffic or some other analysis ? Or
> > this is just to compute some unique #'s ?
> >
> > It will be more helpful if you describe your final use case of the
> computed
> > data too. Given the amount of back and forth, we can take it off list too
> > and summarize the conversation for the list.
> >
> > On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <[email protected]>
> wrote:
> >
> >> We can't rely on the the assumption event dupes will not dupe outside an
> >> hour boundary. So, your take is that, doing a lookup per event within
> the
> >> MR job is going to be bad?
> >>
> >>
> >> ________________________________
> >> From: Viral Bajaria <[email protected]>
> >> To: Rahul Ravindran <[email protected]>
> >> Cc: "[email protected]" <[email protected]>
> >> Sent: Thursday, February 14, 2013 12:48 PM
> >> Subject: Re: Using HBase for Deduping
> >>
> >> You could do with a 2-pronged approach here i.e. some MR and some HBase
> >> lookups. I don't think this is the best solution either given the # of
> >> events you will get.
> >>
> >> FWIW, the solution below again relies on the assumption that if a event
> is
> >> duped in the same hour it won't have a dupe outside of that hour
> boundary.
> >> If it can have then you are better of with running a MR job with the
> >> current hour + another 3 hours of data or an MR job with the current
> hour +
> >> the HBase table as input to the job too (i.e. no HBase lookups, just
> read
> >> the HFile directly) ?
> >>
> >> - Run a MR job which de-dupes events for the current hour i.e. only
> runs on
> >> 1 hour worth of data.
> >> - Mark records which you were not able to de-dupe in the current run
> >> - For the records that you were not able to de-dupe, check against HBase
> >> whether you saw that event in the past. If you did, you can drop the
> >> current event or update the event to the new value (based on your
> business
> >> logic)
> >> - Save all the de-duped events (via HBase bulk upload)
> >>
> >> Sorry if I just rambled along, but without knowing the whole problem
> it's
> >> very tough to come up with a probable solution. So correct my
> assumptions
> >> and we could drill down more.
> >>
> >> Thanks,
> >> Viral
> >>
> >> On Thu, Feb 14, 2013 at 12:29 PM, Rahul Ravindran <[email protected]>
> >> wrote:
> >>
> >>> Most will be in the same hour. Some will be across 3-6 hours.
> >>>
> >>> Sent from my phone.Excuse the terseness.
> >>>
> >>> On Feb 14, 2013, at 12:19 PM, Viral Bajaria <[email protected]>
> >>> wrote:
> >>>
> >>>> Are all these dupe events expected to be within the same hour or they
> >>>> can happen over multiple hours ?
> >>>>
> >>>> Viral
> >>>> From: Rahul Ravindran
> >>>> Sent: 2/14/2013 11:41 AM
> >>>> To: [email protected]
> >>>> Subject: Using HBase for Deduping
> >>>> Hi,
> >>>>   We have events which are delivered into our HDFS cluster which may
> >>>> be duplicated. Each event has a UUID and we were hoping to leverage
> >>Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>
>

Re: Using HBase for Deduping

Reply via email to