Hi,

> Patrick,
> 
> Perhaps I misunderstood Otis' design.
> 
> I thought  he'd create the CF based on duration. 
> So you could have a CF for (daily,  weekly, monthly, annual, indefinite).
> So that you set up the table once with  all CFs.
> Then you'd write the data to one and only one of those  buckets.

That's right.

> The only time you'd have a problem is if you have a tenant who  switches 
> their 
>retention policy. 
>
> Although you could move data still in a CF  so that you still only query one 
> CF 
>for data.

That's right.  Say a tenant decides to switch from keeping his data for 1 month 
to keeping it for 6 months.
Then we'd have to:
1) start writing new data for this tenant to the 6-month CF
2) copy this tenant's old data from 1-month CF to the 6-month CF
3) purge/delete old data for this tenant from 1-month CF

If the tenant wants to go from 6-months to 1-month then we'd additionally want 
to limit copying in step 2) above to just the last 1 month of data and drop the 
rest.

To answer Mike's questions from his other reply:

> What's the data access patterns? Are they discrete between tenants?
> As long as the data access is discrete between tenants and the tenants write 
> to 
>only one bucket, you can do what you suggest.

Yes, data for a given tenant would be written to just 1 of those CFs.

> But here's something to consider...
> You  are going to want to know your tenant's retention policy before you  
>attempt to get the data. This means you read from one column family when  you 
>do 
>your get() and not all of them, right? ;-)

Yes, when reading the data I'd know the tenant's retention policy and based on 
that I'd know from which CF to get the data.


So my question here is: How many such CFs would it be wise to have? 2? 3? 6?


> With respect to your  discussion on region splits..
> So you're saying that if one CF splits then all  of the CFs are affected and 
>split as well?

http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and 
compactions, not splits, but from what I understand flushes can trigger splits 
because they increase the aggregate size of MapFiles, which at some point 
causes 
Region splitting.  Please correct me if I'm wrong. :)

So this is also what I wanted to verify.  As you can imagine, there's likely be 
more tenants with 1-month data retention policy than 1-year or "forever" data 
retention.  So that 1-month CF will grow much more quickly and if I understand 
the above section in HBase book correctly, it means that it will cause all 
other 
CFs' files to split (even if they are not big enough yet), which means more 
disk 
and network IO.

That is, if all those CFs are in the same table.  If they are in different 
tables then this would not happen?

Thanks,
Otis



> >  Date: Thu, 17 Mar 2011 11:26:35 -0400
> > Subject: Re: Suggested and max  number of CFs per table
> > From: [email protected]
> > To: [email protected]
> > CC: [email protected]
> > 
> > Otis,
> > 
> > Perhaps your biggest issue will be the need to  disable the table to add a
> > new CF. So effectively you need to bring down  the application to move in a
> > new tenant.
> > 
> > Another thing  with multiple CFs is that if one CF tends to get
> > disproportionally more  data, you will get a lot of region splitting, and 
the
> > other CFs will  have HFiles for a region that are very small.
> > 
> > I think the only  reasonable use of CFs is if you really need row-level
> > atomicity across  CFs. Otherwise just use multiple tables.
> > 
> > 
> > On Thu, Mar  17, 2011 at 2:30 AM, Otis Gospodnetic <
> > [email protected]>  wrote:
> > 
> > > Hi,
> > >
> > > My Q is around the  suggested or maximum number of CFs per table (see
> > > http://hbase.apache.org/book/schema.html#number.of.cfs )
> >  >
> > > Consider the following use-case.
> > > * A multi-tenant  system.
> > > * All tenants write data to the same table.
> > > *  Tenants have different data retention policies.
> > >
> > > For  the above use case I thought one could then just have different CFs
> > >  with
> > > different TTLs because Stack suggested relying on HBase's  ability to 
purge
> > > old
> > > rows by applying CF-specific  TTLs: 
http://search-hadoop.com/m/VAeb52cvWHV.
> > > These CFs would have  the same set of columns, just different TTLs.  Then
> > >  tenants
> > > who want to keep only last 1 month's worth of data go to  the CF where 
>TTL=1
> > > month, tenants who want to keep last 6 months of  data go to CF where 
TTL=6
> > > months, and so on.  However, tenants  are not going to be evenly 
>distributed
> > > -
> > > there will be  more tenants with shorter data retention periods, which 
>means
> > >  the
> > > CFs where these tenants have their data will grow  faster.
> > >
> > > If I'm reading 
>http://hbase.apache.org/book/schema.html#number.of.cfscorrectly,
> >  > the advice is not to have more than 2-3 CFs per table?
> > > And  what happens if I have say 6 CFs per table?
> > >
> > > Again if I  read the above page correctly, the problem is that uneven data
> > >  distribution will mean that whenever 1 of my CFs needs to be flushed,  
the
> > > remaining 5 CFs will also get flushed at the same time, and  this may (or
> > > will?)
> > > trigger compaction for all CFs'  files creating a sudden IO hit?
> > >
> > > Is there a good  solution for this problem?
> > > Should one then have 6 different tables,  each with just 1 CF instead of
> > > having 1
> > > table with 6  CFs?
> > >
> > > Thanks,
> > > Otis
> > >  ----
> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > Lucene ecosystem  search :: http://search-lucene.com/
> > >
> > >
>                              

Reply via email to