On 12/14/10 12:57 AM, Jonathan Gray wrote:
Hey Adam,

Do you need to scan all of the entries in order to know which ones you need to 
change the expiration of?  Or do you have that information as an input?

I don't have to scan everything, but I also can't pinpoint all the entries in advance. My thought to avoid scanning the entire table is to add a custom filter to the scan object.

As for why you can't insert an older version, it is because HBase sorts all 
columns in descending version order regardless of insertion order.  In order to 
make the latest timestamp of a column older than an existing version, you would 
need to do an explicit delete of the existing version:

    Delete.deleteColumn(byte [] family, byte [] qualifier, long timestamp)

I figured I might need to do something of this sort. Is there a way to subclass TableMapper such that it can output both a Delete and a Put, or does this have to be a multi-pass process?

An alternative approach would be to allow storing multiple versions of your 
columns.  At read time, you would get all versions and could resolve which to 
use based on some piece of metadata you could store (with the real timestamp so 
you know which is latest).
>
If you're going to need fine-grained control and flexibility on TTL policies, 
you might just set HBase to the maximum possible and rely on application logic 
/ metadata stored in HBase.

We actually considered doing it this way and running an intermittent MR job to delete all expired entries, but that job turned out to be pretty long-running and expensive given the size of the tables we're dealing with. Talking to one of the Cloudera engineers it sounded like there were plans for a bulk deletion tool, but for the time being we decided to go with this method as HBase seems to handle it well enough.

I'm not sure what exactly the load patterns or requirements are for your 
application so not sure what the best approach might be.  I commend you for a 
creative use of TTLs and versioning :)

In general we just want the data in HBase to be time-limit (ie such as only staying around for a month). This particular bit I'm working on is a relatively rare use case, and would only apply to a tiny subset of a table.

Thanks
- Adam

Reply via email to