Applying a filter can prevent all rows from being sent to the client (or MR 
task) but HBase will still be doing a full table scan.  This can certainly help 
but in the end you still need to look at all data which is expensive.

I'm not sure what kind of heuristics you're using to determine how to modify 
the expiration of something but it would be far more scalable if you didn't 
require full table scans.

> -----Original Message-----
> From: Adam Phelps [mailto:[email protected]]
> Sent: Tuesday, December 14, 2010 4:48 PM
> To: [email protected]
> Subject: Re: Modifying existing table entries
> 
> On 12/14/10 12:57 AM, Jonathan Gray wrote:
> > Hey Adam,
> >
> > Do you need to scan all of the entries in order to know which ones you
> need to change the expiration of?  Or do you have that information as an
> input?
> 
> I don't have to scan everything, but I also can't pinpoint all the entries in
> advance.  My thought to avoid scanning the entire table is to add a custom
> filter to the scan object.
> 
> > As for why you can't insert an older version, it is because HBase sorts all
> columns in descending version order regardless of insertion order.  In order
> to make the latest timestamp of a column older than an existing version, you
> would need to do an explicit delete of the existing version:
> >
> >     Delete.deleteColumn(byte [] family, byte [] qualifier, long
> > timestamp)
> 
> I figured I might need to do something of this sort.  Is there a way to 
> subclass
> TableMapper such that it can output both a Delete and a Put, or does this
> have to be a multi-pass process?
> 
> > An alternative approach would be to allow storing multiple versions of your
> columns.  At read time, you would get all versions and could resolve which to
> use based on some piece of metadata you could store (with the real
> timestamp so you know which is latest).
>  >
> > If you're going to need fine-grained control and flexibility on TTL 
> > policies,
> you might just set HBase to the maximum possible and rely on application
> logic / metadata stored in HBase.
> 
> We actually considered doing it this way and running an intermittent MR job
> to delete all expired entries, but that job turned out to be pretty long-
> running and expensive given the size of the tables we're dealing with.
> Talking to one of the Cloudera engineers it sounded like there were plans for
> a bulk deletion tool, but for the time being we decided to go with this
> method as HBase seems to handle it well enough.
> 
> > I'm not sure what exactly the load patterns or requirements are for
> > your application so not sure what the best approach might be.  I
> > commend you for a creative use of TTLs and versioning :)
> 
> In general we just want the data in HBase to be time-limit (ie such as only
> staying around for a month).  This particular bit I'm working on is a 
> relatively
> rare use case, and would only apply to a tiny subset of a table.
> 
> Thanks
> - Adam

Reply via email to