Re: Using HBase for Deduping

Rahul Ravindran Tue, 19 Feb 2013 09:33:08 -0800

I could surround with a Try..Catch, but that would each time I insert a UUID 
for the first time (99% of the time), I would do a checkAndPut(), catch the 
resultant exception and perform a Put; so, 2 operations each reduce invocation, 
which is what I was looking to avoid



________________________________
 From: Michael Segel <[email protected]>
To: [email protected]; Rahul Ravindran <[email protected]> 
Sent: Friday, February 15, 2013 9:24 AM
Subject: Re: Using HBase for Deduping
 

Interesting. 

Surround with a Try Catch? 

But it sounds like you're on the right path. 

Happy Coding!


On Feb 15, 2013, at 11:12 AM, Rahul Ravindran <[email protected]> wrote:

I had tried checkAndPut yesterday with a null passed as the value and it had 
thrown an exception when the row did not exist. Perhaps, I was doing something 
wrong. Will try that again, since, yes, I would prefer a checkAndPut().
>
>
>________________________________
>From: Michael Segel <[email protected]>
>To: [email protected] 
>Cc: Rahul Ravindran <[email protected]> 
>Sent: Friday, February 15, 2013 4:36 AM
>Subject: Re: Using HBase for Deduping
>
>
>On Feb 15, 2013, at 3:07 AM, Asaf Mesika <[email protected]> wrote:
>
>
>Michael, this means read for every write?
>>
>>Yes and no. 
>
>At the macro level, a read for every write would mean that your client would 
>read a record from HBase, and then based on some logic it would either write a 
>record, or not. 
>
>So that you have a lot of overhead in the initial get() and then put(). 
>
>At this macro level, with a Check and Put you have less overhead because of a 
>single message to HBase.
>
>Intermal to HBase, you would still have to check the value in the row, if it 
>exists and then perform an insert or not. 
>
>WIth respect to your billion events an hour... 
>
>dividing by 3600 to get the number of events in a second. You would have less 
>than 300,000 events a second. 
>
>What exactly are you doing and how large are those events? 
>
>Since you are processing these events in a batch job, timing doesn't appear to 
>be that important and of course there is also async hbase which may improve 
>some of the performance. 
>
>YMMV but this is a good example of the checkAndPut()
>
>
>
>
>On Friday, February 15, 2013, Michael Segel wrote:
>>
>>
>>What constitutes a duplicate?
>>>
>>>An over simplification is to do a HTable.checkAndPut() where you do the
>>>put if the column doesn't exist.
>>>Then if the row is inserted (TRUE) return value, you push the event.
>>>
>>>That will do what you want.
>>>
>>>At least at first blush.
>>>
>>>
>>>
>>>On Feb 14, 2013, at 3:24 PM, Viral Bajaria <[email protected]>
>>>wrote:
>>>
>>>
>>>Given the size of the data (> 1B rows) and the frequency of job run (once
>>>>per hour), I don't think your most optimal solution is to lookup HBase
>>>>for
>>>
>>>every single event. You will benefit more by loading the HBase table
>>>>directly in your MR job.
>>>>
>>>>In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique
>>>>UUID's ?
>>>
>>>
>>>>Also once you have done the unique, are you going to use the data again
>>>>in
>>>
>>>some other way i.e. online serving of traffic or some other analysis ? Or
>>>>this is just to compute some unique #'s ?
>>>>
>>>>It will be more helpful if you describe your final use case of the
>>>>computed
>>>
>>>data too. Given the amount of back and forth, we can take it off list too
>>>>and summarize the conversation for the list.
>>>>
>>>>On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <[email protected]>
>>>>wrote:
>>>
>>>
>>>>
>>>>We can't rely on the the assumption event dupes will not dupe outside an
>>>>>hour boundary. So, your take is that, doing a lookup per event within
>>>>>the
>>>
>>>MR job is going to be bad?
>>>>>
>>>>>
>>>>>________________________________
>>>>>From: Viral Bajaria <[email protected]>
>>>>>To: Rahul Ravindran <[email protected]>
>>>>>Cc: "[email protected]" <[email protected]>
>>>>>Sent: Thursday, February 14, 2013 12:48 PM
>>>>>Subject: Re: Using HBase for Deduping
>>>>>
>>>>>You could do with a 2-pronged approach here i.e. some MR and some HBase
>>>>>lookups. I don't think this is the best solution either given the # of
>>>>>events you will get.
>>>>>
>>>>>FWIW, the solution below again relies on the assumption that if a event
>>>>>is
>>>
>>>duped in the same hour it won't have a dupe outside of that hour
>>>>>boundary.
>>>
>>>If it can have then you are better of with running a MR job with the
>>>>>current hour + another 3 hours of data or an MR job with the current
>>>>>hour +
>>>
>>>the HBase table as input to the job too (i.e. no HBase lookups, just
>>>>>read
>>>
>>>the HFile directly) ?
>>>>>
>>>>>- Run a MR job which de-dupes events for the current hour i.e. only
>>>>>runs on
>>>
>>>1 hour worth of data.
>>>>>- Mark records which you were not able to de-dupe in the current run
>>>>>- For the records that you were not able to de-dupe, check against HBase
>>>>>whether you saw that event in the past. If you did, you can drop the
>>>>>current event or update the event to the new value (based on your
>>>>>business
>>>
>>>logic)
>>>>>- Save all the de-duped events (via HBase bulk upload)
>>>>>
>>>>>Sorry if I just rambled along, but without knowing the whole problem
>>>>>it's
>>>
>>>very tough to come up with a probable solution. So correct my
>>>>>assumptions
>>>
>>>and we could drill down more.
>>>>>
>>>>>Thanks,
>>>>>Viral
>>>>>
>>>>>On Thu, Feb 14, 2013 at 12:29 PM, Rahul Ravindran <[email protected]>
>>>>>wrote:
>>>>>
>>>>>
>>>>>Most will be in the same hour. Some will be across 3-6 hours.
>>>>>>
>>>>>>Sent from my phone.Excuse the terseness.
>>>>>>
>>>>>>On Feb 14, 2013, at 12:19 PM, Viral Bajaria <[email protected]>
>>>>>>wrote:
>>>>>>
>>>>>>
>>>>>>Are all these dupe events expected to be within the same hour or they
>>>>>>>can happen over multiple hours ?
>>>>>>>
>>>>>>>Viral
>>>>>>>From: Rahul Ravindran
>>>>>>>Sent: 2/14/2013 11:41 AM
>>>>>>>To: [email protected]
>>>>>>>Subject: Using HBase for Deduping
>>>>>>>Hi,
>>>>>>>  We have events which are delivered into our HDFS cluster which may
>>>>>>>be duplicated. Each event has a UUID and we were hoping to leverage
>>>>>>>Michael Segel  | (m) 312.755.9623
>>>>>
>>>Segel and Associates
>>>
>>>
>>>

Michael Segel  | (m) 312.755.9623
Segel and Associates

Re: Using HBase for Deduping

Reply via email to