We could put a hook out of that iterator up into RegionObserver (HBASE-2001), for example.
Currently the observer only gets notified that a compaction has happened. - Andy > From: Jonathan Gray <[email protected]> > Subject: RE: Custom compaction > To: "[email protected]" <[email protected]> > Date: Thursday, May 27, 2010, 6:21 AM > And of course, HBase is open source > so you can hack it up to do what you want :) > > The compaction API basically has an iterator of KeyValues > as input and then returns KeyValues as well. > > > -----Original Message----- > > From: Friso van Vollenhoven [mailto:[email protected]] > > Sent: Thursday, May 27, 2010 1:34 AM > > To: [email protected] > > Subject: Re: Custom compaction > > > > Hi, > > > > Actually, for us it would be nice to be able to hook > > into the compaction, too. > > > > We store records that are basically events that occur > at certain times. > > We store the record itself as qualifier and a timeline > as column value > > (so multiple records+timelines per row key is > possible). So when a new > > record comes in, we do a get for the timeline, merge > the new timestamp > > with the existing timeline in memory and do a put to > update the column > > value with the new timeline. > > > > In our first version, we just wrote the individual > timestamps as values > > and used versioning to keep all timestamps in the > value. Then we > > combined all the timelines and individual timestamp > into a single > > timeline in memory on each read. We ran a MR job > periodically to do the > > timeline combining in the table and delete the > obsolete timestamps in > > order to keep read performance OK (because otherwise > the read operation > > would involve a lot of additional work to create a > timeline and lots of > > versions would be created). In the end, the deletes in > the MR job were > > a bottleneck (as I understand, but I was not on the > project at that > > moment). > > > > Now, if we could hook into the compactions, then we > could just always > > insert individual timestamps as new versions and do > the combining of > > versions into a single timeline during compaction (as > compaction needs > > to go through the complete table anyway). This would > also improve our > > insertion performance (no more gets in there, just > puts like in the > > first version), which is nice. We collect internet > routing information, > > which is collected at 80 million records per day with > updates coming in > > in batches every 5 minutes (http://ris.ripe.net). We'd like to try to > > be efficient before just throwing more machines at the > problem. > > > > Will there be anything like this on the roadmap? > > > > > > Cheers, > > Friso > > > > > > > > On May 27, 2010, at 1:01 AM, Jean-Daniel Cryans > wrote: > > > > > Invisible. What's your need? > > > > > > J-D > > > > > > On Wed, May 26, 2010 at 3:56 PM, Vidhyashankar > Venkataraman > > > <[email protected]> > wrote: > > >> Is there a way to customize the compaction > function (like a hook > > provided by the API) or is it invisible to the user? > > >> > > >> Thank you > > >> Vidhya > > >> > >
