Interesting. Surround with a Try Catch?
But it sounds like you're on the right path. Happy Coding! On Feb 15, 2013, at 11:12 AM, Rahul Ravindran <[email protected]> wrote: > I had tried checkAndPut yesterday with a null passed as the value and it had > thrown an exception when the row did not exist. Perhaps, I was doing > something wrong. Will try that again, since, yes, I would prefer a > checkAndPut(). > > > ________________________________ > From: Michael Segel <[email protected]> > To: [email protected] > Cc: Rahul Ravindran <[email protected]> > Sent: Friday, February 15, 2013 4:36 AM > Subject: Re: Using HBase for Deduping > > > On Feb 15, 2013, at 3:07 AM, Asaf Mesika <[email protected]> wrote: > >> Michael, this means read for every write? >> > Yes and no. > > At the macro level, a read for every write would mean that your client would > read a record from HBase, and then based on some logic it would either write > a record, or not. > > So that you have a lot of overhead in the initial get() and then put(). > > At this macro level, with a Check and Put you have less overhead because of a > single message to HBase. > > Intermal to HBase, you would still have to check the value in the row, if it > exists and then perform an insert or not. > > WIth respect to your billion events an hour... > > dividing by 3600 to get the number of events in a second. You would have less > than 300,000 events a second. > > What exactly are you doing and how large are those events? > > Since you are processing these events in a batch job, timing doesn't appear > to be that important and of course there is also async hbase which may > improve some of the performance. > > YMMV but this is a good example of the checkAndPut() > > > >> On Friday, February 15, 2013, Michael Segel wrote: >> >>> What constitutes a duplicate? >>> >>> An over simplification is to do a HTable.checkAndPut() where you do the >>> put if the column doesn't exist. >>> Then if the row is inserted (TRUE) return value, you push the event. >>> >>> That will do what you want. >>> >>> At least at first blush. >>> >>> >>> >>> On Feb 14, 2013, at 3:24 PM, Viral Bajaria <[email protected]> >>> wrote: >>> >>>> Given the size of the data (> 1B rows) and the frequency of job run (once >>>> per hour), I don't think your most optimal solution is to lookup HBase >>> for >>>> every single event. You will benefit more by loading the HBase table >>>> directly in your MR job. >>>> >>>> In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique >>> UUID's ? >>>> >>>> Also once you have done the unique, are you going to use the data again >>> in >>>> some other way i.e. online serving of traffic or some other analysis ? Or >>>> this is just to compute some unique #'s ? >>>> >>>> It will be more helpful if you describe your final use case of the >>> computed >>>> data too. Given the amount of back and forth, we can take it off list too >>>> and summarize the conversation for the list. >>>> >>>> On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <[email protected]> >>> wrote: >>>> >>>>> We can't rely on the the assumption event dupes will not dupe outside an >>>>> hour boundary. So, your take is that, doing a lookup per event within >>> the >>>>> MR job is going to be bad? >>>>> >>>>> >>>>> ________________________________ >>>>> From: Viral Bajaria <[email protected]> >>>>> To: Rahul Ravindran <[email protected]> >>>>> Cc: "[email protected]" <[email protected]> >>>>> Sent: Thursday, February 14, 2013 12:48 PM >>>>> Subject: Re: Using HBase for Deduping >>>>> >>>>> You could do with a 2-pronged approach here i.e. some MR and some HBase >>>>> lookups. I don't think this is the best solution either given the # of >>>>> events you will get. >>>>> >>>>> FWIW, the solution below again relies on the assumption that if a event >>> is >>>>> duped in the same hour it won't have a dupe outside of that hour >>> boundary. >>>>> If it can have then you are better of with running a MR job with the >>>>> current hour + another 3 hours of data or an MR job with the current >>> hour + >>>>> the HBase table as input to the job too (i.e. no HBase lookups, just >>> read >>>>> the HFile directly) ? >>>>> >>>>> - Run a MR job which de-dupes events for the current hour i.e. only >>> runs on >>>>> 1 hour worth of data. >>>>> - Mark records which you were not able to de-dupe in the current run >>>>> - For the records that you were not able to de-dupe, check against HBase >>>>> whether you saw that event in the past. If you did, you can drop the >>>>> current event or update the event to the new value (based on your >>> business >>>>> logic) >>>>> - Save all the de-duped events (via HBase bulk upload) >>>>> >>>>> Sorry if I just rambled along, but without knowing the whole problem >>> it's >>>>> very tough to come up with a probable solution. So correct my >>> assumptions >>>>> and we could drill down more. >>>>> >>>>> Thanks, >>>>> Viral >>>>> >>>>> On Thu, Feb 14, 2013 at 12:29 PM, Rahul Ravindran <[email protected]> >>>>> wrote: >>>>> >>>>>> Most will be in the same hour. Some will be across 3-6 hours. >>>>>> >>>>>> Sent from my phone.Excuse the terseness. >>>>>> >>>>>> On Feb 14, 2013, at 12:19 PM, Viral Bajaria <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Are all these dupe events expected to be within the same hour or they >>>>>>> can happen over multiple hours ? >>>>>>> >>>>>>> Viral >>>>>>> From: Rahul Ravindran >>>>>>> Sent: 2/14/2013 11:41 AM >>>>>>> To: [email protected] >>>>>>> Subject: Using HBase for Deduping >>>>>>> Hi, >>>>>>> We have events which are delivered into our HDFS cluster which may >>>>>>> be duplicated. Each event has a UUID and we were hoping to leverage >>>>> Michael Segel | (m) 312.755.9623 >>> >>> Segel and Associates >>> >>> Michael Segel | (m) 312.755.9623 Segel and Associates
