RE: Suggested and max number of CFs per table

Michael Segel Thu, 17 Mar 2011 11:35:49 -0700

Otis,

Patrick raised an issue that might be of concern... region splits.


But barring that... what makes the most sense on retention policies?

The point is that its a business issue that will be driving the logic.

Depending on a clarification from Patrick or JGray or JDCryans... you may want 
to consider separate tables using the same key.
You could also use a single table and run a sweeper every night that deletes 
the rows, and then do a major compaction after hours.
(Again you would have to account for the maintenance window.)

HTH

-Mike


> Date: Thu, 17 Mar 2011 10:38:11 -0700
> From: [email protected]
> Subject: Re: Suggested and max number of CFs per table
> To: [email protected]
> 
> Hi,
> 
> 
> > Patrick,
> > 
> > Perhaps I misunderstood Otis' design.
> > 
> > I thought  he'd create the CF based on duration. 
> > So you could have a CF for (daily,  weekly, monthly, annual, indefinite).
> > So that you set up the table once with  all CFs.
> > Then you'd write the data to one and only one of those  buckets.
> 
> That's right.
> 
> > The only time you'd have a problem is if you have a tenant who  switches 
> > their 
> >retention policy. 
> >
> > Although you could move data still in a CF  so that you still only query 
> > one CF 
> >for data.
> 
> That's right.  Say a tenant decides to switch from keeping his data for 1 
> month 
> to keeping it for 6 months.
> Then we'd have to:
> 1) start writing new data for this tenant to the 6-month CF
> 2) copy this tenant's old data from 1-month CF to the 6-month CF
> 3) purge/delete old data for this tenant from 1-month CF
> 
> If the tenant wants to go from 6-months to 1-month then we'd additionally 
> want 
> to limit copying in step 2) above to just the last 1 month of data and drop 
> the 
> rest.
> 
> To answer Mike's questions from his other reply:
> 
> > What's the data access patterns? Are they discrete between tenants?
> > As long as the data access is discrete between tenants and the tenants 
> > write to 
> >only one bucket, you can do what you suggest.
> 
> Yes, data for a given tenant would be written to just 1 of those CFs.
> 
> > But here's something to consider...
> > You  are going to want to know your tenant's retention policy before you  
> >attempt to get the data. This means you read from one column family when  
> >you do 
> >your get() and not all of them, right? ;-)
> 
> Yes, when reading the data I'd know the tenant's retention policy and based 
> on 
> that I'd know from which CF to get the data.
> 
> 
> So my question here is: How many such CFs would it be wise to have? 2? 3? 6?
> 
> 
> > With respect to your  discussion on region splits..
> > So you're saying that if one CF splits then all  of the CFs are affected 
> > and 
> >split as well?
> 
> http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and 
> compactions, not splits, but from what I understand flushes can trigger 
> splits 
> because they increase the aggregate size of MapFiles, which at some point 
> causes 
> Region splitting.  Please correct me if I'm wrong. :)
> 
> So this is also what I wanted to verify.  As you can imagine, there's likely 
> be 
> more tenants with 1-month data retention policy than 1-year or "forever" data 
> retention.  So that 1-month CF will grow much more quickly and if I 
> understand 
> the above section in HBase book correctly, it means that it will cause all 
> other 
> CFs' files to split (even if they are not big enough yet), which means more 
> disk 
> and network IO.
> 
> That is, if all those CFs are in the same table.  If they are in different 
> tables then this would not happen?
> 
> Thanks,
> Otis
> 
> 
> 
> > >  Date: Thu, 17 Mar 2011 11:26:35 -0400
> > > Subject: Re: Suggested and max  number of CFs per table
> > > From: [email protected]
> > > To: [email protected]
> > > CC: [email protected]
> > > 
> > > Otis,
> > > 
> > > Perhaps your biggest issue will be the need to  disable the table to add a
> > > new CF. So effectively you need to bring down  the application to move in 
> > > a
> > > new tenant.
> > > 
> > > Another thing  with multiple CFs is that if one CF tends to get
> > > disproportionally more  data, you will get a lot of region splitting, and 
> the
> > > other CFs will  have HFiles for a region that are very small.
> > > 
> > > I think the only  reasonable use of CFs is if you really need row-level
> > > atomicity across  CFs. Otherwise just use multiple tables.
> > > 
> > > 
> > > On Thu, Mar  17, 2011 at 2:30 AM, Otis Gospodnetic <
> > > [email protected]>  wrote:
> > > 
> > > > Hi,
> > > >
> > > > My Q is around the  suggested or maximum number of CFs per table (see
> > > > http://hbase.apache.org/book/schema.html#number.of.cfs )
> > >  >
> > > > Consider the following use-case.
> > > > * A multi-tenant  system.
> > > > * All tenants write data to the same table.
> > > > *  Tenants have different data retention policies.
> > > >
> > > > For  the above use case I thought one could then just have different CFs
> > > >  with
> > > > different TTLs because Stack suggested relying on HBase's  ability to 
> purge
> > > > old
> > > > rows by applying CF-specific  TTLs: 
> http://search-hadoop.com/m/VAeb52cvWHV.
> > > > These CFs would have  the same set of columns, just different TTLs.  
> > > > Then
> > > >  tenants
> > > > who want to keep only last 1 month's worth of data go to  the CF where 
> >TTL=1
> > > > month, tenants who want to keep last 6 months of  data go to CF where 
> TTL=6
> > > > months, and so on.  However, tenants  are not going to be evenly 
> >distributed
> > > > -
> > > > there will be  more tenants with shorter data retention periods, which 
> >means
> > > >  the
> > > > CFs where these tenants have their data will grow  faster.
> > > >
> > > > If I'm reading 
> >http://hbase.apache.org/book/schema.html#number.of.cfscorrectly,
> > >  > the advice is not to have more than 2-3 CFs per table?
> > > > And  what happens if I have say 6 CFs per table?
> > > >
> > > > Again if I  read the above page correctly, the problem is that uneven 
> > > > data
> > > >  distribution will mean that whenever 1 of my CFs needs to be flushed,  
> the
> > > > remaining 5 CFs will also get flushed at the same time, and  this may 
> > > > (or
> > > > will?)
> > > > trigger compaction for all CFs'  files creating a sudden IO hit?
> > > >
> > > > Is there a good  solution for this problem?
> > > > Should one then have 6 different tables,  each with just 1 CF instead of
> > > > having 1
> > > > table with 6  CFs?
> > > >
> > > > Thanks,
> > > > Otis
> > > >  ----
> > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > > Lucene ecosystem  search :: http://search-lucene.com/
> > > >
> > > >
> >

RE: Suggested and max number of CFs per table

Reply via email to