Re: Data management strategy

Bryan Beaudreault Wed, 21 Dec 2011 22:41:55 -0800

The TTL is per column family, but I think you could still manipulate it
further.  I have no idea if this will work in practice, but I've had
success using versions/timestamps for other reasons in the past and this
idea just came to me.  YMMV.

Determine the maximum amount of time you'll ever want to keep data around.
 You mentioned 30 days, so let's use that.  The timestamps of cell versions
are generated automatically by HBase to be System.currentTimeMillis(), but
you can easily set the timestamps to something else instead.  If you know
how long some data should stick around at time of insertion, set the
timestamp of the put, with org.apache.hadoop.hbase.client.Put.add(byte[]
family, byte[] qualifier, long ts, byte[] value), to
System.currentTimeMillis() - 30 days + <real TTL>.  You now have per-cell
TTLs so to speak.

Like I said, I'd test that this will actually work, and maybe someone else
can chime in as to if this sort of version "abuse" would be frowned upon,
but I think it may get the job done :).

- Bryan

On Wed, Dec 21, 2011 at 1:03 PM, Richard Lawrence <[email protected]>wrote:

> Hi
>
> I was wondering if I could seek some advance about data management in
> HBase?  I plan to use HBase to store data that has a  variable length
> lifespan, the vast majority will be short but occasionally the data life
> time will be significantly longer (3 days versus 3 months).  Once the
> lifespan is over I need the data to be deleted at some point in the near
> future (within a few day is fine).  I don’t think I can use standard TTL
> for this because that’s fixed at a column family level.  Therefore, my plan
> was to run script every few days that looks through external information
> for what needs to be kept and then updates HBase in some way so that it can
> understand.  With the data in HBase I can then use the standard TTL
> mechanism to clean up.
>
> The two ways I can think of to let HBase know are:
>
> Add a co-processor that updates timestamp on each read and then have my
> process simply read the data.  I shied away from this because the
> documentation indicated the co-processor can’t take row locks.  Does that
> imply that it shouldn’t modify the underlying data.  For my use case the
> timestamp doesn’t have to be perfect the keys are created in a such that
> the underlying data is fixed at creation time.
> Add an extra column to each row that’s a cache flag and rewrite that at
> various intervals so that the timestamp updates and prevents the TTL from
> deleting it.
>
> Are there other best practice alternatives?
>
> Thanks
>
> Richard
>
>

Re: Data management strategy

Reply via email to