I could surround with a Try..Catch, but that would each time I insert a UUID for the first time (99% of the time), I would do a checkAndPut(), catch the resultant exception and perform a Put; so, 2 operations each reduce invocation, which is what I was looking to avoid
________________________________ From: Michael Segel <[email protected]> To: [email protected]; Rahul Ravindran <[email protected]> Sent: Friday, February 15, 2013 9:24 AM Subject: Re: Using HBase for Deduping Interesting. Surround with a Try Catch? But it sounds like you're on the right path. Happy Coding! On Feb 15, 2013, at 11:12 AM, Rahul Ravindran <[email protected]> wrote: I had tried checkAndPut yesterday with a null passed as the value and it had thrown an exception when the row did not exist. Perhaps, I was doing something wrong. Will try that again, since, yes, I would prefer a checkAndPut(). > > >________________________________ >From: Michael Segel <[email protected]> >To: [email protected] >Cc: Rahul Ravindran <[email protected]> >Sent: Friday, February 15, 2013 4:36 AM >Subject: Re: Using HBase for Deduping > > >On Feb 15, 2013, at 3:07 AM, Asaf Mesika <[email protected]> wrote: > > >Michael, this means read for every write? >> >>Yes and no. > >At the macro level, a read for every write would mean that your client would >read a record from HBase, and then based on some logic it would either write a >record, or not. > >So that you have a lot of overhead in the initial get() and then put(). > >At this macro level, with a Check and Put you have less overhead because of a >single message to HBase. > >Intermal to HBase, you would still have to check the value in the row, if it >exists and then perform an insert or not. > >WIth respect to your billion events an hour... > >dividing by 3600 to get the number of events in a second. You would have less >than 300,000 events a second. > >What exactly are you doing and how large are those events? > >Since you are processing these events in a batch job, timing doesn't appear to >be that important and of course there is also async hbase which may improve >some of the performance. > >YMMV but this is a good example of the checkAndPut() > > > > >On Friday, February 15, 2013, Michael Segel wrote: >> >> >>What constitutes a duplicate? >>> >>>An over simplification is to do a HTable.checkAndPut() where you do the >>>put if the column doesn't exist. >>>Then if the row is inserted (TRUE) return value, you push the event. >>> >>>That will do what you want. >>> >>>At least at first blush. >>> >>> >>> >>>On Feb 14, 2013, at 3:24 PM, Viral Bajaria <[email protected]> >>>wrote: >>> >>> >>>Given the size of the data (> 1B rows) and the frequency of job run (once >>>>per hour), I don't think your most optimal solution is to lookup HBase >>>>for >>> >>>every single event. You will benefit more by loading the HBase table >>>>directly in your MR job. >>>> >>>>In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique >>>>UUID's ? >>> >>> >>>>Also once you have done the unique, are you going to use the data again >>>>in >>> >>>some other way i.e. online serving of traffic or some other analysis ? Or >>>>this is just to compute some unique #'s ? >>>> >>>>It will be more helpful if you describe your final use case of the >>>>computed >>> >>>data too. Given the amount of back and forth, we can take it off list too >>>>and summarize the conversation for the list. >>>> >>>>On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <[email protected]> >>>>wrote: >>> >>> >>>> >>>>We can't rely on the the assumption event dupes will not dupe outside an >>>>>hour boundary. So, your take is that, doing a lookup per event within >>>>>the >>> >>>MR job is going to be bad? >>>>> >>>>> >>>>>________________________________ >>>>>From: Viral Bajaria <[email protected]> >>>>>To: Rahul Ravindran <[email protected]> >>>>>Cc: "[email protected]" <[email protected]> >>>>>Sent: Thursday, February 14, 2013 12:48 PM >>>>>Subject: Re: Using HBase for Deduping >>>>> >>>>>You could do with a 2-pronged approach here i.e. some MR and some HBase >>>>>lookups. I don't think this is the best solution either given the # of >>>>>events you will get. >>>>> >>>>>FWIW, the solution below again relies on the assumption that if a event >>>>>is >>> >>>duped in the same hour it won't have a dupe outside of that hour >>>>>boundary. >>> >>>If it can have then you are better of with running a MR job with the >>>>>current hour + another 3 hours of data or an MR job with the current >>>>>hour + >>> >>>the HBase table as input to the job too (i.e. no HBase lookups, just >>>>>read >>> >>>the HFile directly) ? >>>>> >>>>>- Run a MR job which de-dupes events for the current hour i.e. only >>>>>runs on >>> >>>1 hour worth of data. >>>>>- Mark records which you were not able to de-dupe in the current run >>>>>- For the records that you were not able to de-dupe, check against HBase >>>>>whether you saw that event in the past. If you did, you can drop the >>>>>current event or update the event to the new value (based on your >>>>>business >>> >>>logic) >>>>>- Save all the de-duped events (via HBase bulk upload) >>>>> >>>>>Sorry if I just rambled along, but without knowing the whole problem >>>>>it's >>> >>>very tough to come up with a probable solution. So correct my >>>>>assumptions >>> >>>and we could drill down more. >>>>> >>>>>Thanks, >>>>>Viral >>>>> >>>>>On Thu, Feb 14, 2013 at 12:29 PM, Rahul Ravindran <[email protected]> >>>>>wrote: >>>>> >>>>> >>>>>Most will be in the same hour. Some will be across 3-6 hours. >>>>>> >>>>>>Sent from my phone.Excuse the terseness. >>>>>> >>>>>>On Feb 14, 2013, at 12:19 PM, Viral Bajaria <[email protected]> >>>>>>wrote: >>>>>> >>>>>> >>>>>>Are all these dupe events expected to be within the same hour or they >>>>>>>can happen over multiple hours ? >>>>>>> >>>>>>>Viral >>>>>>>From: Rahul Ravindran >>>>>>>Sent: 2/14/2013 11:41 AM >>>>>>>To: [email protected] >>>>>>>Subject: Using HBase for Deduping >>>>>>>Hi, >>>>>>> We have events which are delivered into our HDFS cluster which may >>>>>>>be duplicated. Each event has a UUID and we were hoping to leverage >>>>>>>Michael Segel | (m) 312.755.9623 >>>>> >>>Segel and Associates >>> >>> >>> Michael Segel | (m) 312.755.9623 Segel and Associates
