Otis, Patrick raised an issue that might be of concern... region splits.
But barring that... what makes the most sense on retention policies? The point is that its a business issue that will be driving the logic. Depending on a clarification from Patrick or JGray or JDCryans... you may want to consider separate tables using the same key. You could also use a single table and run a sweeper every night that deletes the rows, and then do a major compaction after hours. (Again you would have to account for the maintenance window.) HTH -Mike > Date: Thu, 17 Mar 2011 10:38:11 -0700 > From: [email protected] > Subject: Re: Suggested and max number of CFs per table > To: [email protected] > > Hi, > > > > Patrick, > > > > Perhaps I misunderstood Otis' design. > > > > I thought he'd create the CF based on duration. > > So you could have a CF for (daily, weekly, monthly, annual, indefinite). > > So that you set up the table once with all CFs. > > Then you'd write the data to one and only one of those buckets. > > That's right. > > > The only time you'd have a problem is if you have a tenant who switches > > their > >retention policy. > > > > Although you could move data still in a CF so that you still only query > > one CF > >for data. > > That's right. Say a tenant decides to switch from keeping his data for 1 > month > to keeping it for 6 months. > Then we'd have to: > 1) start writing new data for this tenant to the 6-month CF > 2) copy this tenant's old data from 1-month CF to the 6-month CF > 3) purge/delete old data for this tenant from 1-month CF > > If the tenant wants to go from 6-months to 1-month then we'd additionally > want > to limit copying in step 2) above to just the last 1 month of data and drop > the > rest. > > To answer Mike's questions from his other reply: > > > What's the data access patterns? Are they discrete between tenants? > > As long as the data access is discrete between tenants and the tenants > > write to > >only one bucket, you can do what you suggest. > > Yes, data for a given tenant would be written to just 1 of those CFs. > > > But here's something to consider... > > You are going to want to know your tenant's retention policy before you > >attempt to get the data. This means you read from one column family when > >you do > >your get() and not all of them, right? ;-) > > Yes, when reading the data I'd know the tenant's retention policy and based > on > that I'd know from which CF to get the data. > > > So my question here is: How many such CFs would it be wise to have? 2? 3? 6? > > > > With respect to your discussion on region splits.. > > So you're saying that if one CF splits then all of the CFs are affected > > and > >split as well? > > http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and > compactions, not splits, but from what I understand flushes can trigger > splits > because they increase the aggregate size of MapFiles, which at some point > causes > Region splitting. Please correct me if I'm wrong. :) > > So this is also what I wanted to verify. As you can imagine, there's likely > be > more tenants with 1-month data retention policy than 1-year or "forever" data > retention. So that 1-month CF will grow much more quickly and if I > understand > the above section in HBase book correctly, it means that it will cause all > other > CFs' files to split (even if they are not big enough yet), which means more > disk > and network IO. > > That is, if all those CFs are in the same table. If they are in different > tables then this would not happen? > > Thanks, > Otis > > > > > > Date: Thu, 17 Mar 2011 11:26:35 -0400 > > > Subject: Re: Suggested and max number of CFs per table > > > From: [email protected] > > > To: [email protected] > > > CC: [email protected] > > > > > > Otis, > > > > > > Perhaps your biggest issue will be the need to disable the table to add a > > > new CF. So effectively you need to bring down the application to move in > > > a > > > new tenant. > > > > > > Another thing with multiple CFs is that if one CF tends to get > > > disproportionally more data, you will get a lot of region splitting, and > the > > > other CFs will have HFiles for a region that are very small. > > > > > > I think the only reasonable use of CFs is if you really need row-level > > > atomicity across CFs. Otherwise just use multiple tables. > > > > > > > > > On Thu, Mar 17, 2011 at 2:30 AM, Otis Gospodnetic < > > > [email protected]> wrote: > > > > > > > Hi, > > > > > > > > My Q is around the suggested or maximum number of CFs per table (see > > > > http://hbase.apache.org/book/schema.html#number.of.cfs ) > > > > > > > > Consider the following use-case. > > > > * A multi-tenant system. > > > > * All tenants write data to the same table. > > > > * Tenants have different data retention policies. > > > > > > > > For the above use case I thought one could then just have different CFs > > > > with > > > > different TTLs because Stack suggested relying on HBase's ability to > purge > > > > old > > > > rows by applying CF-specific TTLs: > http://search-hadoop.com/m/VAeb52cvWHV. > > > > These CFs would have the same set of columns, just different TTLs. > > > > Then > > > > tenants > > > > who want to keep only last 1 month's worth of data go to the CF where > >TTL=1 > > > > month, tenants who want to keep last 6 months of data go to CF where > TTL=6 > > > > months, and so on. However, tenants are not going to be evenly > >distributed > > > > - > > > > there will be more tenants with shorter data retention periods, which > >means > > > > the > > > > CFs where these tenants have their data will grow faster. > > > > > > > > If I'm reading > >http://hbase.apache.org/book/schema.html#number.of.cfscorrectly, > > > > the advice is not to have more than 2-3 CFs per table? > > > > And what happens if I have say 6 CFs per table? > > > > > > > > Again if I read the above page correctly, the problem is that uneven > > > > data > > > > distribution will mean that whenever 1 of my CFs needs to be flushed, > the > > > > remaining 5 CFs will also get flushed at the same time, and this may > > > > (or > > > > will?) > > > > trigger compaction for all CFs' files creating a sudden IO hit? > > > > > > > > Is there a good solution for this problem? > > > > Should one then have 6 different tables, each with just 1 CF instead of > > > > having 1 > > > > table with 6 CFs? > > > > > > > > Thanks, > > > > Otis > > > > ---- > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > > >
