Hi,

> Patrick raised an issue that might be of concern... region  splits.

Right.  And if I understand correctly, if I want to have multiple CFs that grow 
unevenly, these region splits are something I have to then be willing to accept.

> But barring that... what makes the most sense on retention  policies?
>
> The point is that its a business issue that will be driving the  logic.

The exact business requirement is not defined yet.  Say 3-4 retention policies.

> Depending on a clarification from Patrick or JGray or JDCryans...  you may 
> want 
>to consider separate tables using the same key.
> You could also  use a single table and run a sweeper every night that deletes 
>the rows, and then  do a major compaction after hours.
> (Again you would have to account for the  maintenance window.)

Right.  The reason I am even thinking about CF-per-retention-policy is because 
I 
am afraid of a big and expensive nightly scan-and-delete.  That said, I don't 
actually know how and if this scan will be expensive.  So I'm trying to 
understand pros and cons ahead of time.  Maybe I'm prematurely optimizing, but 
since this feels like a big structural/architectural change, I thought it would 
be worth "getting it right" before I have lots of tenants and their data in the 
system.

Thank you everyone!

Otis


> HTH
> 
> -Mike
> 
> 
> > Date: Thu, 17 Mar  2011 10:38:11 -0700
> > From: [email protected]
> >  Subject: Re: Suggested and max number of CFs per table
> > To: [email protected]
> > 
> >  Hi,
> > 
> > 
> > > Patrick,
> > > 
> > > Perhaps I  misunderstood Otis' design.
> > > 
> > > I thought  he'd  create the CF based on duration. 
> > > So you could have a CF for  (daily,  weekly, monthly, annual, indefinite).
> > > So that you set  up the table once with  all CFs.
> > > Then you'd write the data to  one and only one of those  buckets.
> > 
> > That's right.
> > 
> > > The only time you'd have a problem is if you have a tenant  who  switches 
>their 
>
> > >retention policy. 
> > >
> >  > Although you could move data still in a CF  so that you still only  
> > query 
>one CF 
>
> > >for data.
> > 
> > That's right.  Say a  tenant decides to switch from keeping his data for 1 
>month 
>
> > to keeping  it for 6 months.
> > Then we'd have to:
> > 1) start writing new data  for this tenant to the 6-month CF
> > 2) copy this tenant's old data from  1-month CF to the 6-month CF
> > 3) purge/delete old data for this tenant  from 1-month CF
> > 
> > If the tenant wants to go from 6-months to  1-month then we'd additionally 
>want 
>
> > to limit copying in step 2) above  to just the last 1 month of data and 
> > drop 
>the 
>
> > rest.
> > 
> > To  answer Mike's questions from his other reply:
> > 
> > > What's the  data access patterns? Are they discrete between tenants?
> > > As long as  the data access is discrete between tenants and the tenants 
>write to 
>
> >  >only one bucket, you can do what you suggest.
> > 
> > Yes, data for  a given tenant would be written to just 1 of those CFs.
> > 
> > >  But here's something to consider...
> > > You  are going to want to  know your tenant's retention policy before you 
> > >  

> > >attempt to get  the data. This means you read from one column family when  
>you do 
>
> >  >your get() and not all of them, right? ;-)
> > 
> > Yes, when  reading the data I'd know the tenant's retention policy and 
> > based 
>on 
>
> >  that I'd know from which CF to get the data.
> > 
> > 
> > So my  question here is: How many such CFs would it be wise to have? 2? 3? 
6?
> > 
> > 
> > > With respect to your  discussion on region  splits..
> > > So you're saying that if one CF splits then all  of  the CFs are affected 
>and 
>
> > >split as well?
> > 
> > http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and 
> > compactions, not splits, but from what I understand flushes can trigger  
>splits 
>
> > because they increase the aggregate size of MapFiles, which at  some point 
>causes 
>
> > Region splitting.  Please correct me if I'm  wrong. :)
> > 
> > So this is also what I wanted to verify.  As you  can imagine, there's 
> > likely 
>be 
>
> > more tenants with 1-month data retention  policy than 1-year or "forever" 
>data 
>
> > retention.  So that 1-month  CF will grow much more quickly and if I 
>understand 
>
> > the above section in  HBase book correctly, it means that it will cause all 
>other 
>
> > CFs' files  to split (even if they are not big enough yet), which means 
> > more 
>disk 
>
> >  and network IO.
> > 
> > That is, if all those CFs are in the same  table.  If they are in different 
> > tables then this would not  happen?
> > 
> > Thanks,
> > Otis
> > 
> > 
> > 
> >  > >  Date: Thu, 17 Mar 2011 11:26:35 -0400
> > > > Subject:  Re: Suggested and max  number of CFs per table
> > > > From: [email protected]
> > > >  To: [email protected]
> > > >  CC: [email protected]
> >  > > 
> > > > Otis,
> > > > 
> > > > Perhaps  your biggest issue will be the need to  disable the table to 
> > > > add 
>a
> >  > > new CF. So effectively you need to bring down  the application to  
> > move 
>in a
> > > > new tenant.
> > > > 
> > > >  Another thing  with multiple CFs is that if one CF tends to get
> >  > > disproportionally more  data, you will get a lot of region  splitting, 
>and 
>
> > the
> > > > other CFs will  have HFiles  for a region that are very small.
> > > > 
> > > > I think  the only  reasonable use of CFs is if you really need row-level
> >  > > atomicity across  CFs. Otherwise just use multiple  tables.
> > > > 
> > > > 
> > > > On Thu,  Mar  17, 2011 at 2:30 AM, Otis Gospodnetic <
> > > > [email protected]>   wrote:
> > > > 
> > > > > Hi,
> > > >  >
> > > > > My Q is around the  suggested or maximum number  of CFs per table (see
> > > > > http://hbase.apache.org/book/schema.html#number.of.cfs )
> > >  >  >
> > > > > Consider the following use-case.
> >  > > > * A multi-tenant  system.
> > > > > * All  tenants write data to the same table.
> > > > > *  Tenants  have different data retention policies.
> > > > >
> > > >  > For  the above use case I thought one could then just have different 
> > > >  
>CFs
> > > > >  with
> > > > > different TTLs  because Stack suggested relying on HBase's  ability 
> > > > > to 

> >  purge
> > > > > old
> > > > > rows by applying  CF-specific  TTLs: 
> > http://search-hadoop.com/m/VAeb52cvWHV.
> > > > > These CFs  would have  the same set of columns, just different TTLs.  
> > > > >  
>Then
> > > > >  tenants
> > > > > who want to  keep only last 1 month's worth of data go to  the CF 
> > > > > where 
>
> >  >TTL=1
> > > > > month, tenants who want to keep last 6 months  of  data go to CF 
> > > > > where 

> > TTL=6
> > > > > months, and  so on.  However, tenants  are not going to be evenly 
> >  >distributed
> > > > > -
> > > > > there will  be  more tenants with shorter data retention periods, 
> > > > > which 
>
> >  >means
> > > > >  the
> > > > > CFs where  these tenants have their data will grow  faster.
> > > >  >
> > > > > If I'm reading 
> > >http://hbase.apache.org/book/schema.html#number.of.cfscorrectly,
> >  > >  > the advice is not to have more than 2-3 CFs per  table?
> > > > > And  what happens if I have say 6 CFs per  table?
> > > > >
> > > > > Again if I  read the  above page correctly, the problem is that 
> > > > > uneven 
>data
> > > >  >  distribution will mean that whenever 1 of my CFs needs to be  
>flushed,  
>
> > the
> > > > > remaining 5 CFs will also get  flushed at the same time, and  this 
> > > > > may 
>(or
> > > > >  will?)
> > > > > trigger compaction for all CFs'  files  creating a sudden IO hit?
> > > > >
> > > > > Is there  a good  solution for this problem?
> > > > > Should one then  have 6 different tables,  each with just 1 CF 
> > > > > instead 
>of
> > > >  > having 1
> > > > > table with 6  CFs?
> > > >  >
> > > > > Thanks,
> > > > > Otis
> > >  > >  ----
> > > > > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > > > > Lucene ecosystem  search :: http://search-lucene.com/
> > > > >
> > > >  >
> > >                               
>                              

Reply via email to