Hi,
> Patrick raised an issue that might be of concern... region splits. Right. And if I understand correctly, if I want to have multiple CFs that grow unevenly, these region splits are something I have to then be willing to accept. > But barring that... what makes the most sense on retention policies? > > The point is that its a business issue that will be driving the logic. The exact business requirement is not defined yet. Say 3-4 retention policies. > Depending on a clarification from Patrick or JGray or JDCryans... you may > want >to consider separate tables using the same key. > You could also use a single table and run a sweeper every night that deletes >the rows, and then do a major compaction after hours. > (Again you would have to account for the maintenance window.) Right. The reason I am even thinking about CF-per-retention-policy is because I am afraid of a big and expensive nightly scan-and-delete. That said, I don't actually know how and if this scan will be expensive. So I'm trying to understand pros and cons ahead of time. Maybe I'm prematurely optimizing, but since this feels like a big structural/architectural change, I thought it would be worth "getting it right" before I have lots of tenants and their data in the system. Thank you everyone! Otis > HTH > > -Mike > > > > Date: Thu, 17 Mar 2011 10:38:11 -0700 > > From: [email protected] > > Subject: Re: Suggested and max number of CFs per table > > To: [email protected] > > > > Hi, > > > > > > > Patrick, > > > > > > Perhaps I misunderstood Otis' design. > > > > > > I thought he'd create the CF based on duration. > > > So you could have a CF for (daily, weekly, monthly, annual, indefinite). > > > So that you set up the table once with all CFs. > > > Then you'd write the data to one and only one of those buckets. > > > > That's right. > > > > > The only time you'd have a problem is if you have a tenant who switches >their > > > >retention policy. > > > > > > Although you could move data still in a CF so that you still only > > query >one CF > > > >for data. > > > > That's right. Say a tenant decides to switch from keeping his data for 1 >month > > > to keeping it for 6 months. > > Then we'd have to: > > 1) start writing new data for this tenant to the 6-month CF > > 2) copy this tenant's old data from 1-month CF to the 6-month CF > > 3) purge/delete old data for this tenant from 1-month CF > > > > If the tenant wants to go from 6-months to 1-month then we'd additionally >want > > > to limit copying in step 2) above to just the last 1 month of data and > > drop >the > > > rest. > > > > To answer Mike's questions from his other reply: > > > > > What's the data access patterns? Are they discrete between tenants? > > > As long as the data access is discrete between tenants and the tenants >write to > > > >only one bucket, you can do what you suggest. > > > > Yes, data for a given tenant would be written to just 1 of those CFs. > > > > > But here's something to consider... > > > You are going to want to know your tenant's retention policy before you > > > > > >attempt to get the data. This means you read from one column family when >you do > > > >your get() and not all of them, right? ;-) > > > > Yes, when reading the data I'd know the tenant's retention policy and > > based >on > > > that I'd know from which CF to get the data. > > > > > > So my question here is: How many such CFs would it be wise to have? 2? 3? 6? > > > > > > > With respect to your discussion on region splits.. > > > So you're saying that if one CF splits then all of the CFs are affected >and > > > >split as well? > > > > http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and > > compactions, not splits, but from what I understand flushes can trigger >splits > > > because they increase the aggregate size of MapFiles, which at some point >causes > > > Region splitting. Please correct me if I'm wrong. :) > > > > So this is also what I wanted to verify. As you can imagine, there's > > likely >be > > > more tenants with 1-month data retention policy than 1-year or "forever" >data > > > retention. So that 1-month CF will grow much more quickly and if I >understand > > > the above section in HBase book correctly, it means that it will cause all >other > > > CFs' files to split (even if they are not big enough yet), which means > > more >disk > > > and network IO. > > > > That is, if all those CFs are in the same table. If they are in different > > tables then this would not happen? > > > > Thanks, > > Otis > > > > > > > > > > Date: Thu, 17 Mar 2011 11:26:35 -0400 > > > > Subject: Re: Suggested and max number of CFs per table > > > > From: [email protected] > > > > To: [email protected] > > > > CC: [email protected] > > > > > > > > Otis, > > > > > > > > Perhaps your biggest issue will be the need to disable the table to > > > > add >a > > > > new CF. So effectively you need to bring down the application to > > move >in a > > > > new tenant. > > > > > > > > Another thing with multiple CFs is that if one CF tends to get > > > > disproportionally more data, you will get a lot of region splitting, >and > > > the > > > > other CFs will have HFiles for a region that are very small. > > > > > > > > I think the only reasonable use of CFs is if you really need row-level > > > > atomicity across CFs. Otherwise just use multiple tables. > > > > > > > > > > > > On Thu, Mar 17, 2011 at 2:30 AM, Otis Gospodnetic < > > > > [email protected]> wrote: > > > > > > > > > Hi, > > > > > > > > > > My Q is around the suggested or maximum number of CFs per table (see > > > > > http://hbase.apache.org/book/schema.html#number.of.cfs ) > > > > > > > > > > Consider the following use-case. > > > > > * A multi-tenant system. > > > > > * All tenants write data to the same table. > > > > > * Tenants have different data retention policies. > > > > > > > > > > For the above use case I thought one could then just have different > > > > >CFs > > > > > with > > > > > different TTLs because Stack suggested relying on HBase's ability > > > > > to > > purge > > > > > old > > > > > rows by applying CF-specific TTLs: > > http://search-hadoop.com/m/VAeb52cvWHV. > > > > > These CFs would have the same set of columns, just different TTLs. > > > > > >Then > > > > > tenants > > > > > who want to keep only last 1 month's worth of data go to the CF > > > > > where > > > >TTL=1 > > > > > month, tenants who want to keep last 6 months of data go to CF > > > > > where > > TTL=6 > > > > > months, and so on. However, tenants are not going to be evenly > > >distributed > > > > > - > > > > > there will be more tenants with shorter data retention periods, > > > > > which > > > >means > > > > > the > > > > > CFs where these tenants have their data will grow faster. > > > > > > > > > > If I'm reading > > >http://hbase.apache.org/book/schema.html#number.of.cfscorrectly, > > > > > the advice is not to have more than 2-3 CFs per table? > > > > > And what happens if I have say 6 CFs per table? > > > > > > > > > > Again if I read the above page correctly, the problem is that > > > > > uneven >data > > > > > distribution will mean that whenever 1 of my CFs needs to be >flushed, > > > the > > > > > remaining 5 CFs will also get flushed at the same time, and this > > > > > may >(or > > > > > will?) > > > > > trigger compaction for all CFs' files creating a sudden IO hit? > > > > > > > > > > Is there a good solution for this problem? > > > > > Should one then have 6 different tables, each with just 1 CF > > > > > instead >of > > > > > having 1 > > > > > table with 6 CFs? > > > > > > > > > > Thanks, > > > > > Otis > > > > > ---- > > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > > > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > > > > > > >
