On Wed, Mar 9, 2011 at 8:04 PM, Otis Gospodnetic
<[email protected]> wrote:
> Hi,
>
> For some reason there are suddenly lots of questions about purging old data.
> I'm looking at the same thing and was wondering:
>
> * In my case, the same table is shared by multiple users, each of which may 
> have
> a different data retention policy.  Thus, I think I need to look at each and
> every row and check if it's considered "expired" and thus ready for deletion.
> Ideally, I'd associate a TTL when I Put a row and HBase would automagically
> remove it when its time is up, but I don't think TTLs per row are doable, and
> neither is automagical expiration, right?
>

TTLs are per column family though the TTL you talk of above seems
different than the CF TTL.  You want a row-based TTL?


> * Is the only option to have a column with the expiration timestamp, and have 
> a
> nightly MR job that does a full table scan and purges all expired rows?

I don't know any other way.

> Wouldn't that be *super* costly because *all* data would have to be read from
> disk just for this one thing?


Yeah, it'd be costly unless you added this 'meta' info into a separate CF.

> And this would evict all good stuff from the OS
> cache (and maybe block cache and memstore?)  Is there a better way?
>

Not from blockcache or memstore.  Scans usually by-pass blockcache
IIRC (If I have it wrong here, I know there is a flag to set on Scans
to say whether to go via blockcache or not).


> * Are there specific recommendations for how to define tables to be able  to
> efficiently remove batches of rows on a regular basis?
>

We're used to TTLs or max versions setting on the column family
schema.  You want something more exotic Otis.

Why remove the data at all?  Or why not just let hbase do its TTL
cleanup.  Is it space you are worried about?

St.Ack

Reply via email to