Can you say a bit more about your data organization?

Are you storing transactions of some kind?   If so an your key involve time?
 I think that putting some extract of time (day number perhaps) as a
leading

Are you storing profiles where the key is the user (or something) id and the
data is essentially a list of transactions?  If so, can you segregate
transactions into separate column families that can be dropped as data
expires?

When you say data expiration varies by customer, is that really necessary or
can you have a lowest common denominator for actual deletions with rules
that govern how much data is actually visible to the consumer of the data?

On Mon, May 9, 2011 at 2:59 AM, Ophir Cohen <[email protected]> wrote:

> Hi All,
> In my company currently we are working hard on deployment our cluster with
> HBase.
>
> We talking of ~20 nodes to hold pretty big data (~1TB per day).
>
> As there is a lot of data, we need a retention method, i.e. a way to remove
> old data.
>
> The problem is that I can't/want to do it using TTL cause two reasons:
>
>   1. Different retention policy for different customers.
>   2. Policy might be changed.
>
>
> Of course, I can do it using nightly (weekly?) MR job that runs on all data
> and remove the old data.
> There is few problems:
>
>   1. Running on huge amount of data only to remove small portion of it.
>   2. It'll be a heavily MR job.
>   3. Need to perform main compaction afterwards - that will affect
>   performance or even stop service (is that right???).
>
> I might use BulkFileOutputFormat for that job - but still have those
> problems.
>
> As my data sorted by the retention policies (customers and time) I thought
> of this option:
>
>   1. Split regions and create region with 'candidates to removed'.
>   2. Drop this region.
>
>
>   - Is it possible to drop region?
>   - Do you think it a good idea?
>   - Any other ideas?
>
> Thanks,
>
> Ophir Cohen
> LivePerson
>

Reply via email to