Hi, > > * In my case, the same table is shared by multiple users, each of which > > may >have
> > a different data retention policy. Thus, I think I need to look at each and > > every row and check if it's considered "expired" and thus ready for >deletion. > > Ideally, I'd associate a TTL when I Put a row and HBase would automagically > > remove it when its time is up, but I don't think TTLs per row are doable, >and > > neither is automagical expiration, right? > > > > TTLs are per column family though the TTL you talk of above seems > different than the CF TTL. You want a row-based TTL? Right. Imagine I have 3 users that each have different TTL for their data. Then I think I need something like this (key, data, expiration date): user1_key1 data 2011-04-01 user2_key1 data 2022-01-01 user3_key1 data -1 (say that -1 means "never expire - keep") > > * Is the only option to have a column with the expiration timestamp, and >have a > > nightly MR job that does a full table scan and purges all expired rows? > > I don't know any other way. > > > Wouldn't that be *super* costly because *all* data would have to be read >from > > disk just for this one thing? > > Yeah, it'd be costly unless you added this 'meta' info into a separate CF. Right, separate CF. > > And this would evict all good stuff from the OS > > cache (and maybe block cache and memstore?) Is there a better way? > > > > Not from blockcache or memstore. Scans usually by-pass blockcache > IIRC (If I have it wrong here, I know there is a flag to set on Scans > to say whether to go via blockcache or not). OK, good :) And I suppose the OS dirtying is minimized by having a CF with *just* this date/timestamp Column in it? > > * Are there specific recommendations for how to define tables to be able to > > efficiently remove batches of rows on a regular basis? > > > > We're used to TTLs or max versions setting on the column family > schema. You want something more exotic Otis. > > Why remove the data at all? Or why not just let hbase do its TTL > cleanup. Is it space you are worried about? Yeah, it's about the space and cost associated with it. I'd love to let HBase do its TTL, but it looks like a single TTL is for all rows in a given CF, which means that I'd have to have my different users use different CFs instead of having their data in the same CF. Is that what you mean? Thanks, Otis
