Hi,
> Patrick, > > Perhaps I misunderstood Otis' design. > > I thought he'd create the CF based on duration. > So you could have a CF for (daily, weekly, monthly, annual, indefinite). > So that you set up the table once with all CFs. > Then you'd write the data to one and only one of those buckets. That's right. > The only time you'd have a problem is if you have a tenant who switches > their >retention policy. > > Although you could move data still in a CF so that you still only query one > CF >for data. That's right. Say a tenant decides to switch from keeping his data for 1 month to keeping it for 6 months. Then we'd have to: 1) start writing new data for this tenant to the 6-month CF 2) copy this tenant's old data from 1-month CF to the 6-month CF 3) purge/delete old data for this tenant from 1-month CF If the tenant wants to go from 6-months to 1-month then we'd additionally want to limit copying in step 2) above to just the last 1 month of data and drop the rest. To answer Mike's questions from his other reply: > What's the data access patterns? Are they discrete between tenants? > As long as the data access is discrete between tenants and the tenants write > to >only one bucket, you can do what you suggest. Yes, data for a given tenant would be written to just 1 of those CFs. > But here's something to consider... > You are going to want to know your tenant's retention policy before you >attempt to get the data. This means you read from one column family when you >do >your get() and not all of them, right? ;-) Yes, when reading the data I'd know the tenant's retention policy and based on that I'd know from which CF to get the data. So my question here is: How many such CFs would it be wise to have? 2? 3? 6? > With respect to your discussion on region splits.. > So you're saying that if one CF splits then all of the CFs are affected and >split as well? http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and compactions, not splits, but from what I understand flushes can trigger splits because they increase the aggregate size of MapFiles, which at some point causes Region splitting. Please correct me if I'm wrong. :) So this is also what I wanted to verify. As you can imagine, there's likely be more tenants with 1-month data retention policy than 1-year or "forever" data retention. So that 1-month CF will grow much more quickly and if I understand the above section in HBase book correctly, it means that it will cause all other CFs' files to split (even if they are not big enough yet), which means more disk and network IO. That is, if all those CFs are in the same table. If they are in different tables then this would not happen? Thanks, Otis > > Date: Thu, 17 Mar 2011 11:26:35 -0400 > > Subject: Re: Suggested and max number of CFs per table > > From: [email protected] > > To: [email protected] > > CC: [email protected] > > > > Otis, > > > > Perhaps your biggest issue will be the need to disable the table to add a > > new CF. So effectively you need to bring down the application to move in a > > new tenant. > > > > Another thing with multiple CFs is that if one CF tends to get > > disproportionally more data, you will get a lot of region splitting, and the > > other CFs will have HFiles for a region that are very small. > > > > I think the only reasonable use of CFs is if you really need row-level > > atomicity across CFs. Otherwise just use multiple tables. > > > > > > On Thu, Mar 17, 2011 at 2:30 AM, Otis Gospodnetic < > > [email protected]> wrote: > > > > > Hi, > > > > > > My Q is around the suggested or maximum number of CFs per table (see > > > http://hbase.apache.org/book/schema.html#number.of.cfs ) > > > > > > Consider the following use-case. > > > * A multi-tenant system. > > > * All tenants write data to the same table. > > > * Tenants have different data retention policies. > > > > > > For the above use case I thought one could then just have different CFs > > > with > > > different TTLs because Stack suggested relying on HBase's ability to purge > > > old > > > rows by applying CF-specific TTLs: http://search-hadoop.com/m/VAeb52cvWHV. > > > These CFs would have the same set of columns, just different TTLs. Then > > > tenants > > > who want to keep only last 1 month's worth of data go to the CF where >TTL=1 > > > month, tenants who want to keep last 6 months of data go to CF where TTL=6 > > > months, and so on. However, tenants are not going to be evenly >distributed > > > - > > > there will be more tenants with shorter data retention periods, which >means > > > the > > > CFs where these tenants have their data will grow faster. > > > > > > If I'm reading >http://hbase.apache.org/book/schema.html#number.of.cfscorrectly, > > > the advice is not to have more than 2-3 CFs per table? > > > And what happens if I have say 6 CFs per table? > > > > > > Again if I read the above page correctly, the problem is that uneven data > > > distribution will mean that whenever 1 of my CFs needs to be flushed, the > > > remaining 5 CFs will also get flushed at the same time, and this may (or > > > will?) > > > trigger compaction for all CFs' files creating a sudden IO hit? > > > > > > Is there a good solution for this problem? > > > Should one then have 6 different tables, each with just 1 CF instead of > > > having 1 > > > table with 6 CFs? > > > > > > Thanks, > > > Otis > > > ---- > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > >
