Michael, this means read for every write? On Friday, February 15, 2013, Michael Segel wrote:
> What constitutes a duplicate? > > An over simplification is to do a HTable.checkAndPut() where you do the > put if the column doesn't exist. > Then if the row is inserted (TRUE) return value, you push the event. > > That will do what you want. > > At least at first blush. > > > > On Feb 14, 2013, at 3:24 PM, Viral Bajaria <[email protected]> > wrote: > > > Given the size of the data (> 1B rows) and the frequency of job run (once > > per hour), I don't think your most optimal solution is to lookup HBase > for > > every single event. You will benefit more by loading the HBase table > > directly in your MR job. > > > > In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique > UUID's ? > > > > Also once you have done the unique, are you going to use the data again > in > > some other way i.e. online serving of traffic or some other analysis ? Or > > this is just to compute some unique #'s ? > > > > It will be more helpful if you describe your final use case of the > computed > > data too. Given the amount of back and forth, we can take it off list too > > and summarize the conversation for the list. > > > > On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <[email protected]> > wrote: > > > >> We can't rely on the the assumption event dupes will not dupe outside an > >> hour boundary. So, your take is that, doing a lookup per event within > the > >> MR job is going to be bad? > >> > >> > >> ________________________________ > >> From: Viral Bajaria <[email protected]> > >> To: Rahul Ravindran <[email protected]> > >> Cc: "[email protected]" <[email protected]> > >> Sent: Thursday, February 14, 2013 12:48 PM > >> Subject: Re: Using HBase for Deduping > >> > >> You could do with a 2-pronged approach here i.e. some MR and some HBase > >> lookups. I don't think this is the best solution either given the # of > >> events you will get. > >> > >> FWIW, the solution below again relies on the assumption that if a event > is > >> duped in the same hour it won't have a dupe outside of that hour > boundary. > >> If it can have then you are better of with running a MR job with the > >> current hour + another 3 hours of data or an MR job with the current > hour + > >> the HBase table as input to the job too (i.e. no HBase lookups, just > read > >> the HFile directly) ? > >> > >> - Run a MR job which de-dupes events for the current hour i.e. only > runs on > >> 1 hour worth of data. > >> - Mark records which you were not able to de-dupe in the current run > >> - For the records that you were not able to de-dupe, check against HBase > >> whether you saw that event in the past. If you did, you can drop the > >> current event or update the event to the new value (based on your > business > >> logic) > >> - Save all the de-duped events (via HBase bulk upload) > >> > >> Sorry if I just rambled along, but without knowing the whole problem > it's > >> very tough to come up with a probable solution. So correct my > assumptions > >> and we could drill down more. > >> > >> Thanks, > >> Viral > >> > >> On Thu, Feb 14, 2013 at 12:29 PM, Rahul Ravindran <[email protected]> > >> wrote: > >> > >>> Most will be in the same hour. Some will be across 3-6 hours. > >>> > >>> Sent from my phone.Excuse the terseness. > >>> > >>> On Feb 14, 2013, at 12:19 PM, Viral Bajaria <[email protected]> > >>> wrote: > >>> > >>>> Are all these dupe events expected to be within the same hour or they > >>>> can happen over multiple hours ? > >>>> > >>>> Viral > >>>> From: Rahul Ravindran > >>>> Sent: 2/14/2013 11:41 AM > >>>> To: [email protected] > >>>> Subject: Using HBase for Deduping > >>>> Hi, > >>>> We have events which are delivered into our HDFS cluster which may > >>>> be duplicated. Each event has a UUID and we were hoping to leverage > >>Michael Segel | (m) 312.755.9623 > > Segel and Associates > > >
