Well,
Maybe its a lack of sleep, but this is what I found...
checkAndPut
public boolean checkAndPut(byte[] row,
byte[] family,
byte[] qualifier,
byte[] value,
Put put)
throws IOException
Atomically checks if a row/family/qualifier value matches the expected value.
If it does, it adds the put. If the passed value is null, the check is for the
lack of column (ie: non-existance)
Specified by:
checkAndPut in interface HTableInterface
Parameters:
row - to check
family - column family to check
qualifier - column qualifier to check
value - the expected value
put - data to put if check succeeds
Returns:
true if the new put was executed, false otherwise
Throws:
IOException - e
Maybe I'm reading it wrong?
But hey! What do I know? Its Valentine's Day and I'm spending my evening
answering questions sitting in my man cave instead of spending it with my wife.
Its no wonder I live in the perpetual dog house! :-P
On Feb 14, 2013, at 7:35 PM, Rahul Ravindran <[email protected]> wrote:
> Checkandput() does not work when the row does not exist, or am I missing
> something?
>
> Sent from my phone.Excuse the terseness.
>
> On Feb 14, 2013, at 5:33 PM, Michael Segel <[email protected]> wrote:
>
>> What constitutes a duplicate?
>>
>> An over simplification is to do a HTable.checkAndPut() where you do the put
>> if the column doesn't exist.
>> Then if the row is inserted (TRUE) return value, you push the event.
>>
>> That will do what you want.
>>
>> At least at first blush.
>>
>>
>>
>> On Feb 14, 2013, at 3:24 PM, Viral Bajaria <[email protected]> wrote:
>>
>>> Given the size of the data (> 1B rows) and the frequency of job run (once
>>> per hour), I don't think your most optimal solution is to lookup HBase for
>>> every single event. You will benefit more by loading the HBase table
>>> directly in your MR job.
>>>
>>> In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique UUID's ?
>>>
>>> Also once you have done the unique, are you going to use the data again in
>>> some other way i.e. online serving of traffic or some other analysis ? Or
>>> this is just to compute some unique #'s ?
>>>
>>> It will be more helpful if you describe your final use case of the computed
>>> data too. Given the amount of back and forth, we can take it off list too
>>> and summarize the conversation for the list.
>>>
>>> On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <[email protected]> wrote:
>>>
>>>> We can't rely on the the assumption event dupes will not dupe outside an
>>>> hour boundary. So, your take is that, doing a lookup per event within the
>>>> MR job is going to be bad?
>>>>
>>>>
>>>> ________________________________
>>>> From: Viral Bajaria <[email protected]>
>>>> To: Rahul Ravindran <[email protected]>
>>>> Cc: "[email protected]" <[email protected]>
>>>> Sent: Thursday, February 14, 2013 12:48 PM
>>>> Subject: Re: Using HBase for Deduping
>>>>
>>>> You could do with a 2-pronged approach here i.e. some MR and some HBase
>>>> lookups. I don't think this is the best solution either given the # of
>>>> events you will get.
>>>>
>>>> FWIW, the solution below again relies on the assumption that if a event is
>>>> duped in the same hour it won't have a dupe outside of that hour boundary.
>>>> If it can have then you are better of with running a MR job with the
>>>> current hour + another 3 hours of data or an MR job with the current hour +
>>>> the HBase table as input to the job too (i.e. no HBase lookups, just read
>>>> the HFile directly) ?
>>>>
>>>> - Run a MR job which de-dupes events for the current hour i.e. only runs on
>>>> 1 hour worth of data.
>>>> - Mark records which you were not able to de-dupe in the current run
>>>> - For the records that you were not able to de-dupe, check against HBase
>>>> whether you saw that event in the past. If you did, you can drop the
>>>> current event or update the event to the new value (based on your business
>>>> logic)
>>>> - Save all the de-duped events (via HBase bulk upload)
>>>>
>>>> Sorry if I just rambled along, but without knowing the whole problem it's
>>>> very tough to come up with a probable solution. So correct my assumptions
>>>> and we could drill down more.
>>>>
>>>> Thanks,
>>>> Viral
>>>>
>>>> On Thu, Feb 14, 2013 at 12:29 PM, Rahul Ravindran <[email protected]>
>>>> wrote:
>>>>
>>>>> Most will be in the same hour. Some will be across 3-6 hours.
>>>>>
>>>>> Sent from my phone.Excuse the terseness.
>>>>>
>>>>> On Feb 14, 2013, at 12:19 PM, Viral Bajaria <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Are all these dupe events expected to be within the same hour or they
>>>>>> can happen over multiple hours ?
>>>>>>
>>>>>> Viral
>>>>>> From: Rahul Ravindran
>>>>>> Sent: 2/14/2013 11:41 AM
>>>>>> To: [email protected]
>>>>>> Subject: Using HBase for Deduping
>>>>>> Hi,
>>>>>> We have events which are delivered into our HDFS cluster which may
>>>>>> be duplicated. Each event has a UUID and we were hoping to leverage
>>>>>> HBase to dedupe them. We run a MapReduce job which would perform a
>>>>>> lookup for each UUID on HBase and then emit the event only if the UUID
>>>>>> was absent and would also insert into the HBase table(This is
>>>>>> simplistic, I am missing out details to make this more resilient to
>>>>>> failures). My concern is that doing a Read+Write for every event in MR
>>>>>> would be slow (We expect around 1 Billion events every hour). Does
>>>>>> anyone use Hbase for a similar use case or is there a different
>>>>>> approach to achieving the same end result. Any information, comments
>>>>>> would be great.
>>>>>>
>>>>>> Thanks,
>>>>>> ~Rahul.
>>
>> Michael Segel | (m) 312.755.9623
>>
>> Segel and Associates
>>
>>
>
Michael Segel | (m) 312.755.9623
Segel and Associates