Re: Using HBase for Deduping

Michael Segel Thu, 14 Feb 2013 17:44:05 -0800

Well, 
Maybe its a lack of sleep, but this is what I found...
 checkAndPut

public boolean checkAndPut(byte[] row,
                           byte[] family,
                           byte[] qualifier,
                           byte[] value,
                           Put put)
                    throws IOException
Atomically checks if a row/family/qualifier value matches the expected value. 
If it does, it adds the put. If the passed value is null, the check is for the 
lack of column (ie: non-existance)


Specified by:
checkAndPut in interface HTableInterface
Parameters:
row - to check
family - column family to check
qualifier - column qualifier to check
value - the expected value
put - data to put if check succeeds
Returns:
true if the new put was executed, false otherwise
Throws:
IOException - e
Maybe I'm reading it wrong? 

But hey! What do I know? Its Valentine's Day and I'm spending my evening 
answering questions sitting in my man cave instead of spending it with my wife. 
Its no wonder I live in the perpetual dog house! :-P


On Feb 14, 2013, at 7:35 PM, Rahul Ravindran <[email protected]> wrote:

> Checkandput() does not work when the row does not exist, or am I missing 
> something?
> 
> Sent from my phone.Excuse the terseness.
> 
> On Feb 14, 2013, at 5:33 PM, Michael Segel <[email protected]> wrote:
> 
>> What constitutes a duplicate? 
>> 
>> An over simplification is to do a HTable.checkAndPut() where you do the put 
>> if the column doesn't exist. 
>> Then if the row is inserted (TRUE) return value, you push the event. 
>> 
>> That will do what you want.
>> 
>> At least at first blush. 
>> 
>> 
>> 
>> On Feb 14, 2013, at 3:24 PM, Viral Bajaria <[email protected]> wrote:
>> 
>>> Given the size of the data (> 1B rows) and the frequency of job run (once
>>> per hour), I don't think your most optimal solution is to lookup HBase for
>>> every single event. You will benefit more by loading the HBase table
>>> directly in your MR job.
>>> 
>>> In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique UUID's ?
>>> 
>>> Also once you have done the unique, are you going to use the data again in
>>> some other way i.e. online serving of traffic or some other analysis ? Or
>>> this is just to compute some unique #'s ?
>>> 
>>> It will be more helpful if you describe your final use case of the computed
>>> data too. Given the amount of back and forth, we can take it off list too
>>> and summarize the conversation for the list.
>>> 
>>> On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <[email protected]> wrote:
>>> 
>>>> We can't rely on the the assumption event dupes will not dupe outside an
>>>> hour boundary. So, your take is that, doing a lookup per event within the
>>>> MR job is going to be bad?
>>>> 
>>>> 
>>>> ________________________________
>>>> From: Viral Bajaria <[email protected]>
>>>> To: Rahul Ravindran <[email protected]>
>>>> Cc: "[email protected]" <[email protected]>
>>>> Sent: Thursday, February 14, 2013 12:48 PM
>>>> Subject: Re: Using HBase for Deduping
>>>> 
>>>> You could do with a 2-pronged approach here i.e. some MR and some HBase
>>>> lookups. I don't think this is the best solution either given the # of
>>>> events you will get.
>>>> 
>>>> FWIW, the solution below again relies on the assumption that if a event is
>>>> duped in the same hour it won't have a dupe outside of that hour boundary.
>>>> If it can have then you are better of with running a MR job with the
>>>> current hour + another 3 hours of data or an MR job with the current hour +
>>>> the HBase table as input to the job too (i.e. no HBase lookups, just read
>>>> the HFile directly) ?
>>>> 
>>>> - Run a MR job which de-dupes events for the current hour i.e. only runs on
>>>> 1 hour worth of data.
>>>> - Mark records which you were not able to de-dupe in the current run
>>>> - For the records that you were not able to de-dupe, check against HBase
>>>> whether you saw that event in the past. If you did, you can drop the
>>>> current event or update the event to the new value (based on your business
>>>> logic)
>>>> - Save all the de-duped events (via HBase bulk upload)
>>>> 
>>>> Sorry if I just rambled along, but without knowing the whole problem it's
>>>> very tough to come up with a probable solution. So correct my assumptions
>>>> and we could drill down more.
>>>> 
>>>> Thanks,
>>>> Viral
>>>> 
>>>> On Thu, Feb 14, 2013 at 12:29 PM, Rahul Ravindran <[email protected]>
>>>> wrote:
>>>> 
>>>>> Most will be in the same hour. Some will be across 3-6 hours.
>>>>> 
>>>>> Sent from my phone.Excuse the terseness.
>>>>> 
>>>>> On Feb 14, 2013, at 12:19 PM, Viral Bajaria <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Are all these dupe events expected to be within the same hour or they
>>>>>> can happen over multiple hours ?
>>>>>> 
>>>>>> Viral
>>>>>> From: Rahul Ravindran
>>>>>> Sent: 2/14/2013 11:41 AM
>>>>>> To: [email protected]
>>>>>> Subject: Using HBase for Deduping
>>>>>> Hi,
>>>>>> We have events which are delivered into our HDFS cluster which may
>>>>>> be duplicated. Each event has a UUID and we were hoping to leverage
>>>>>> HBase to dedupe them. We run a MapReduce job which would perform a
>>>>>> lookup for each UUID on HBase and then emit the event only if the UUID
>>>>>> was absent and would also insert into the HBase table(This is
>>>>>> simplistic, I am missing out details to make this more resilient to
>>>>>> failures). My concern is that doing a Read+Write for every event in MR
>>>>>> would be slow (We expect around 1 Billion events every hour). Does
>>>>>> anyone use Hbase for a similar use case or is there a different
>>>>>> approach to achieving the same end result. Any information, comments
>>>>>> would be great.
>>>>>> 
>>>>>> Thanks,
>>>>>> ~Rahul.
>> 
>> Michael Segel  | (m) 312.755.9623
>> 
>> Segel and Associates
>> 
>> 
> 

Michael Segel  | (m) 312.755.9623

Segel and Associates

Re: Using HBase for Deduping

Reply via email to