Otis, you sure are busy blogging. ;-)

Ok but to answer your question... you want as few column families as possible.

When we first started looking at HBase, we tried to view the column families as 
if they were relational tables and the key was a foreign key joining the two 
tables.
(Its actually not a bad way for RDBMs data modelers to look at a column 
oriented database for the first time....)

The trouble is that when you take someone who follows 3rd normal form design, 
you end up reading from two or more column families at the same time. This is 
where your problems begin because the data is actually stored in separate 
files, so you take a performance hit.

With respect to your example... 
What's the data access patterns? Are they discrete between tenants?
As long as the data access is discrete between tenants and the tenants write to 
only one bucket, you can do what you suggest.
But here's something to consider...
You are going to want to know your tenant's retention policy before you attempt 
to get the data. This means you read from one column family when you do your 
get() and not all of them, right? ;-)

HTH

-Mike



> Date: Wed, 16 Mar 2011 23:30:14 -0700
> From: [email protected]
> Subject: Suggested and max number of CFs per table
> To: [email protected]
> 
> Hi,
> 
> My Q is around the suggested or maximum number of CFs per table (see 
> http://hbase.apache.org/book/schema.html#number.of.cfs )
> 
> Consider the following use-case.
> * A multi-tenant system.
> * All tenants write data to the same table.
> * Tenants have different data retention policies.
> 
> For the above use case I thought one could then just have different CFs with 
> different TTLs because Stack suggested relying on HBase's ability to purge 
> old 
> rows by applying CF-specific TTLs: http://search-hadoop.com/m/VAeb52cvWHV.  
> These CFs would have the same set of columns, just different TTLs.  Then 
> tenants 
> who want to keep only last 1 month's worth of data go to the CF where TTL=1 
> month, tenants who want to keep last 6 months of data go to CF where TTL=6 
> months, and so on.  However, tenants are not going to be evenly distributed - 
> there will be more tenants with shorter data retention periods, which means 
> the 
> CFs where these tenants have their data will grow faster.
> 
> If I'm reading http://hbase.apache.org/book/schema.html#number.of.cfs 
> correctly, 
> the advice is not to have more than 2-3 CFs per table?
> And what happens if I have say 6 CFs per table?
> 
> Again if I read the above page correctly, the problem is that uneven data 
> distribution will mean that whenever 1 of my CFs needs to be flushed, the 
> remaining 5 CFs will also get flushed at the same time, and this may (or 
> will?) 
> trigger compaction for all CFs' files creating a sudden IO hit?
> 
> Is there a good solution for this problem?
> Should one then have 6 different tables, each with just 1 CF instead of 
> having 1 
> table with 6 CFs?
> 
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
                                          

Reply via email to