If you change your key to "date - customer id - time stamp - session id"
then you shouldn't lose any important
data locality, but you would be able to delete things more efficiently.

For one thing, any map-reduce programs that are running for deleting would
be doing dense scans over a small
part of your data.  That might make them run much faster.

For another, you should be able to do the region switch trick and then drop
entire regions.  That has the unfortunate
side-effect of requiring that you disable the table for a short period (I
think).

On Mon, May 9, 2011 at 10:09 AM, Ophir Cohen <[email protected]> wrote:

> Thanks for the answer!
>
> A little bit more info:
> Our data is internal events grouped for sessions (i.e. group of events).
> There is differnet sessions to differnet customers.
> We talking about millions sessions per day.
>
> The key is *customer id - time stamp - sessions id.
> *
> So, yes it sorted by customer and date and as I want to remove rows by
> customer and date - it sorted all right.
> Actually the main motivation to remove old rows is that we have storage
> limitations (and too much data...).
>
> So, my concern if we can do something better than nightly/weekly map reduce
> job that will ends up with a major compaction.
> Ophir
> PS
> The majorty of my customers share the same retention policy but I still
> need
> abilty to change it for a specific customer.
>
>
> On Mon, May 9, 2011 at 6:48 PM, Ted Dunning <[email protected]> wrote:
>
> > Can you say a bit more about your data organization?
> >
> > Are you storing transactions of some kind?   If so an your key involve
> > time?
> >  I think that putting some extract of time (day number perhaps) as a
> > leading
> >
> > Are you storing profiles where the key is the user (or something) id and
> > the
> > data is essentially a list of transactions?  If so, can you segregate
> > transactions into separate column families that can be dropped as data
> > expires?
> >
> > When you say data expiration varies by customer, is that really necessary
> > or
> > can you have a lowest common denominator for actual deletions with rules
> > that govern how much data is actually visible to the consumer of the
> data?
> >
> > On Mon, May 9, 2011 at 2:59 AM, Ophir Cohen <[email protected]> wrote:
> >
> > > Hi All,
> > > In my company currently we are working hard on deployment our cluster
> > with
> > > HBase.
> > >
> > > We talking of ~20 nodes to hold pretty big data (~1TB per day).
> > >
> > > As there is a lot of data, we need a retention method, i.e. a way to
> > remove
> > > old data.
> > >
> > > The problem is that I can't/want to do it using TTL cause two reasons:
> > >
> > >   1. Different retention policy for different customers.
> > >   2. Policy might be changed.
> > >
> > >
> > > Of course, I can do it using nightly (weekly?) MR job that runs on all
> > data
> > > and remove the old data.
> > > There is few problems:
> > >
> > >   1. Running on huge amount of data only to remove small portion of it.
> > >   2. It'll be a heavily MR job.
> > >   3. Need to perform main compaction afterwards - that will affect
> > >   performance or even stop service (is that right???).
> > >
> > > I might use BulkFileOutputFormat for that job - but still have those
> > > problems.
> > >
> > > As my data sorted by the retention policies (customers and time) I
> > thought
> > > of this option:
> > >
> > >   1. Split regions and create region with 'candidates to removed'.
> > >   2. Drop this region.
> > >
> > >
> > >   - Is it possible to drop region?
> > >   - Do you think it a good idea?
> > >   - Any other ideas?
> > >
> > > Thanks,
> > >
> > > Ophir Cohen
> > > LivePerson
> > >
> >
>

Reply via email to