This is not currently on any road map as far as I know.  But I do think it's 
interesting nonetheless.

Piggybacking on compactions can be a good time to get some additional work done 
on your data since we're already doing the work of reading and writing several 
HFiles.

One concern is compaction performance.  In HBase's architecture, overall 
performance can be significantly impacted by slow-running compactions.

Another concern is that minor compactions do not always include all files of a 
region.  That may limit what you can effectively do during a compaction since 
you may not be seeing all of the data.  This is not the case for major 
compactions which always compact every file in a region, however.

Friso, for your specific use case, what you are trying to do is evict older 
versions of data?  I had a little bit of trouble understanding your schema.  Or 
what you're doing is periodically take a bunch of versions of a column and 
combine them into a single version/value?  How many of these versions are you 
adding for each column?   Is it really the case that read performance is 
unacceptable if the data is spread across multiple versions?  One of the 
benefits of HBase is that these versions will be stored sequentially on disk so 
the read of multiple versions (within reason) should be not significantly 
slower than one.

In any case, this is an interesting direction and I think it's worth exploring. 
 As for how this would work, that I'm not so sure about.  Perhaps building on 
Andrew's work with Coprocessors, RegionObservers, etc...

JG

> -----Original Message-----
> From: Friso van Vollenhoven [mailto:[email protected]]
> Sent: Thursday, May 27, 2010 1:34 AM
> To: [email protected]
> Subject: Re: Custom compaction
> 
> Hi,
> 
> Actually, for us it would be nice to be able to hook into the
> compaction, too.
> 
> We store records that are basically events that occur at certain times.
> We store the record itself as qualifier and a timeline as column value
> (so multiple records+timelines per row key is possible). So when a new
> record comes in, we do a get for the timeline, merge the new timestamp
> with the existing timeline in memory and do a put to update the column
> value with the new timeline.
> 
> In our first version, we just wrote the individual timestamps as values
> and used versioning to keep all timestamps in the value. Then we
> combined all the timelines and individual timestamp into a single
> timeline in memory on each read. We ran a MR job periodically to do the
> timeline combining in the table and delete the obsolete timestamps in
> order to keep read performance OK (because otherwise the read operation
> would involve a lot of additional work to create a timeline and lots of
> versions would be created). In the end, the deletes in the MR job were
> a bottleneck (as I understand, but I was not on the project at that
> moment).
> 
> Now, if we could hook into the compactions, then we could just always
> insert individual timestamps as new versions and do the combining of
> versions into a single timeline during compaction (as compaction needs
> to go through the complete table anyway). This would also improve our
> insertion performance (no more gets in there, just puts like in the
> first version), which is nice. We collect internet routing information,
> which is collected at 80 million records per day with updates coming in
> in batches every 5 minutes (http://ris.ripe.net). We'd like to try to
> be efficient before just throwing more machines at the problem.
> 
> Will there be anything like this on the roadmap?
> 
> 
> Cheers,
> Friso
> 
> 
> 
> On May 27, 2010, at 1:01 AM, Jean-Daniel Cryans wrote:
> 
> > Invisible. What's your need?
> >
> > J-D
> >
> > On Wed, May 26, 2010 at 3:56 PM, Vidhyashankar Venkataraman
> > <[email protected]> wrote:
> >> Is there a way to customize the compaction function (like a hook
> provided by the API) or is it invisible to the user?
> >>
> >> Thank you
> >> Vidhya
> >>

Reply via email to