Re: Using HBase for Deduping

Michael Segel Fri, 15 Feb 2013 09:25:30 -0800

Interesting. 

Surround with a Try Catch?


But it sounds like you're on the right path. 

Happy Coding!

On Feb 15, 2013, at 11:12 AM, Rahul Ravindran <[email protected]> wrote:

> I had tried checkAndPut yesterday with a null passed as the value and it had 
> thrown an exception when the row did not exist. Perhaps, I was doing 
> something wrong. Will try that again, since, yes, I would prefer a 
> checkAndPut().
> 
> 
> ________________________________
> From: Michael Segel <[email protected]>
> To: [email protected] 
> Cc: Rahul Ravindran <[email protected]> 
> Sent: Friday, February 15, 2013 4:36 AM
> Subject: Re: Using HBase for Deduping
> 
> 
> On Feb 15, 2013, at 3:07 AM, Asaf Mesika <[email protected]> wrote:
> 
>> Michael, this means read for every write?
>> 
> Yes and no. 
> 
> At the macro level, a read for every write would mean that your client would 
> read a record from HBase, and then based on some logic it would either write 
> a record, or not. 
> 
> So that you have a lot of overhead in the initial get() and then put(). 
> 
> At this macro level, with a Check and Put you have less overhead because of a 
> single message to HBase.
> 
> Intermal to HBase, you would still have to check the value in the row, if it 
> exists and then perform an insert or not. 
> 
> WIth respect to your billion events an hour... 
> 
> dividing by 3600 to get the number of events in a second. You would have less 
> than 300,000 events a second. 
> 
> What exactly are you doing and how large are those events? 
> 
> Since you are processing these events in a batch job, timing doesn't appear 
> to be that important and of course there is also async hbase which may 
> improve some of the performance. 
> 
> YMMV but this is a good example of the checkAndPut()
> 
> 
> 
>> On Friday, February 15, 2013, Michael Segel wrote:
>> 
>>> What constitutes a duplicate?
>>> 
>>> An over simplification is to do a HTable.checkAndPut() where you do the
>>> put if the column doesn't exist.
>>> Then if the row is inserted (TRUE) return value, you push the event.
>>> 
>>> That will do what you want.
>>> 
>>> At least at first blush.
>>> 
>>> 
>>> 
>>> On Feb 14, 2013, at 3:24 PM, Viral Bajaria <[email protected]>
>>> wrote:
>>> 
>>>> Given the size of the data (> 1B rows) and the frequency of job run (once
>>>> per hour), I don't think your most optimal solution is to lookup HBase
>>> for
>>>> every single event. You will benefit more by loading the HBase table
>>>> directly in your MR job.
>>>> 
>>>> In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique
>>> UUID's ?
>>>> 
>>>> Also once you have done the unique, are you going to use the data again
>>> in
>>>> some other way i.e. online serving of traffic or some other analysis ? Or
>>>> this is just to compute some unique #'s ?
>>>> 
>>>> It will be more helpful if you describe your final use case of the
>>> computed
>>>> data too. Given the amount of back and forth, we can take it off list too
>>>> and summarize the conversation for the list.
>>>> 
>>>> On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <[email protected]>
>>> wrote:
>>>> 
>>>>> We can't rely on the the assumption event dupes will not dupe outside an
>>>>> hour boundary. So, your take is that, doing a lookup per event within
>>> the
>>>>> MR job is going to be bad?
>>>>> 
>>>>> 
>>>>> ________________________________
>>>>> From: Viral Bajaria <[email protected]>
>>>>> To: Rahul Ravindran <[email protected]>
>>>>> Cc: "[email protected]" <[email protected]>
>>>>> Sent: Thursday, February 14, 2013 12:48 PM
>>>>> Subject: Re: Using HBase for Deduping
>>>>> 
>>>>> You could do with a 2-pronged approach here i.e. some MR and some HBase
>>>>> lookups. I don't think this is the best solution either given the # of
>>>>> events you will get.
>>>>> 
>>>>> FWIW, the solution below again relies on the assumption that if a event
>>> is
>>>>> duped in the same hour it won't have a dupe outside of that hour
>>> boundary.
>>>>> If it can have then you are better of with running a MR job with the
>>>>> current hour + another 3 hours of data or an MR job with the current
>>> hour +
>>>>> the HBase table as input to the job too (i.e. no HBase lookups, just
>>> read
>>>>> the HFile directly) ?
>>>>> 
>>>>> - Run a MR job which de-dupes events for the current hour i.e. only
>>> runs on
>>>>> 1 hour worth of data.
>>>>> - Mark records which you were not able to de-dupe in the current run
>>>>> - For the records that you were not able to de-dupe, check against HBase
>>>>> whether you saw that event in the past. If you did, you can drop the
>>>>> current event or update the event to the new value (based on your
>>> business
>>>>> logic)
>>>>> - Save all the de-duped events (via HBase bulk upload)
>>>>> 
>>>>> Sorry if I just rambled along, but without knowing the whole problem
>>> it's
>>>>> very tough to come up with a probable solution. So correct my
>>> assumptions
>>>>> and we could drill down more.
>>>>> 
>>>>> Thanks,
>>>>> Viral
>>>>> 
>>>>> On Thu, Feb 14, 2013 at 12:29 PM, Rahul Ravindran <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Most will be in the same hour. Some will be across 3-6 hours.
>>>>>> 
>>>>>> Sent from my phone.Excuse the terseness.
>>>>>> 
>>>>>> On Feb 14, 2013, at 12:19 PM, Viral Bajaria <[email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Are all these dupe events expected to be within the same hour or they
>>>>>>> can happen over multiple hours ?
>>>>>>> 
>>>>>>> Viral
>>>>>>> From: Rahul Ravindran
>>>>>>> Sent: 2/14/2013 11:41 AM
>>>>>>> To: [email protected]
>>>>>>> Subject: Using HBase for Deduping
>>>>>>> Hi,
>>>>>>>   We have events which are delivered into our HDFS cluster which may
>>>>>>> be duplicated. Each event has a UUID and we were hoping to leverage
>>>>> Michael Segel  | (m) 312.755.9623
>>> 
>>> Segel and Associates
>>> 
>>> 

Michael Segel  | (m) 312.755.9623

Segel and Associates

Re: Using HBase for Deduping

Reply via email to