Data retention in HBase

Ophir Cohen Mon, 09 May 2011 03:00:16 -0700

Hi All,
In my company currently we are working hard on deployment our cluster with
HBase.


We talking of ~20 nodes to hold pretty big data (~1TB per day).

As there is a lot of data, we need a retention method, i.e. a way to remove
old data.

The problem is that I can't/want to do it using TTL cause two reasons:

   1. Different retention policy for different customers.
   2. Policy might be changed.


Of course, I can do it using nightly (weekly?) MR job that runs on all data
and remove the old data.
There is few problems:

   1. Running on huge amount of data only to remove small portion of it.
   2. It'll be a heavily MR job.
   3. Need to perform main compaction afterwards - that will affect
   performance or even stop service (is that right???).

I might use BulkFileOutputFormat for that job - but still have those
problems.

As my data sorted by the retention policies (customers and time) I thought
of this option:

   1. Split regions and create region with 'candidates to removed'.
   2. Drop this region.


   - Is it possible to drop region?
   - Do you think it a good idea?
   - Any other ideas?

Thanks,

Ophir Cohen
LivePerson

Data retention in HBase

Reply via email to