Re: Reasonable range for the max number of tables?
We use Cassandra for multi-tenant application. Each tenant has own set of tables and we have 1592 tables in total in our production Cassandra cluster. It's running well and doesn't have any memory consumption issue, but the challenge confronting us is the schema change problem.We have such a large amount of tables, moreover a significant amount of them have around as many as 50 columns, which totals 33099 rows in schema_columns table in system keyspace. Every time we do schema change for all of our tenants, the whole cluster will get very busy and applications running against it need to be shut down for several hours to accommodate this change. The way we solve this is using a new feature we developed called template. There is a detailed description in the JIRA issue we opened: https://issues.apache.org/jira/browse/CASSANDRA-7643 We have some performance results in our 15-node test cluster. Normally creating 400 tables takes more than hours for all the migration stage tasks to complete , but if we create 400 tables with templates, *it just takes 1 to 2 seconds*. It also works great for alter table. [image: Inline image 1] [image: Inline image 1] table # in the graph means the number of existing tables in user keyspaces. We created 400 more tables and measure the time all tasks in migration stage take to complete. Besides, we also measure the migration task completion time for adding one column for a template, which will also add the column for all the column families with that template. We believe what we proposed here can be very useful for other people in the Cassandra community as well. We have attached the patch in the JIRA. You can also read the community feedbacks there. Thanks, Cheng On Tue, Aug 5, 2014 at 5:43 AM, Michal Michalski michal.michal...@boxever.com wrote: - Use a keyspace per customer These effectively amount to the same thing and they both fall foul to the limit in the number of column families so do not scale. But then you can scale by moving some of the customers to a new cluster easily. If you keep everything in a single keyspace or - worse - if you do your multitenancy by prefixing row keys with customer ids of some kind, it won't be that easy, as you wrote later in your e-mail. M. Kind regards, Michał Michalski, michal.michal...@boxever.com On 5 August 2014 12:36, Phil Luckhurst phil.luckhu...@powerassure.com wrote: Hi Mark, Mark Reddy wrote To segregate customer data, you could: - Use customer specific column families under a single keyspace - Use a keyspace per customer These effectively amount to the same thing and they both fall foul to the limit in the number of column families so do not scale. Mark Reddy wrote - Use the same column families and have a column that identifies the customer. On the application layer ensure that there are sufficient checks so one customer can't read another customers data And while this gets around the column family limit it does not allow the same level of data segregation. For example with a separate keyspace or column families it is trivial to remove a single customer's data or move that data to another system. With one set of column families for all customers these types of actions become much more difficult as any change impacts all customers but perhaps that's the price we have to pay to scale. And I still think this needs to be made more prominent in the documentation. Thanks Phil -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Reasonable-range-for-the-max-number-of-tables-tp7596094p7596119.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Reasonable range for the max number of tables?
Is there any mention of this limitation anywhere in the Cassandra documentation? I don't see it mentioned in the 'Anti-patterns in Cassandra' section of the DataStax 2.0 documentation or anywhere else. When starting out with Cassandra as a store for a multi-tenant application it seems very attractive to segregate data for each tenant using a tenant specific keyspace each with their own set of tables. It's not until you start browsing through forums such as this that you find out that it isn't going to scale above a few tenants. If you want to be able to segregate customer data in Cassandra is it the accepted practice to have multiple Cassandra installations? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Reasonable-range-for-the-max-number-of-tables-tp7596094p7596106.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Reasonable range for the max number of tables?
Hi Phil, In theory, the max number of column families would be in the low number of hundreds. In practice the limit is related the amount of heap you have, as each column family will consume 1 MB of heap due to arena allocation. To segregate customer data, you could: - Use customer specific column families under a single keyspace - Use a keyspace per customer - Use the same column families and have a column that identifies the customer. On the application layer ensure that there are sufficient checks so one customer can't read another customers data Mark On Tue, Aug 5, 2014 at 9:09 AM, Phil Luckhurst phil.luckhu...@powerassure.com wrote: Is there any mention of this limitation anywhere in the Cassandra documentation? I don't see it mentioned in the 'Anti-patterns in Cassandra' section of the DataStax 2.0 documentation or anywhere else. When starting out with Cassandra as a store for a multi-tenant application it seems very attractive to segregate data for each tenant using a tenant specific keyspace each with their own set of tables. It's not until you start browsing through forums such as this that you find out that it isn't going to scale above a few tenants. If you want to be able to segregate customer data in Cassandra is it the accepted practice to have multiple Cassandra installations? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Reasonable-range-for-the-max-number-of-tables-tp7596094p7596106.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Reasonable range for the max number of tables?
Hi Mark, Mark Reddy wrote To segregate customer data, you could: - Use customer specific column families under a single keyspace - Use a keyspace per customer These effectively amount to the same thing and they both fall foul to the limit in the number of column families so do not scale. Mark Reddy wrote - Use the same column families and have a column that identifies the customer. On the application layer ensure that there are sufficient checks so one customer can't read another customers data And while this gets around the column family limit it does not allow the same level of data segregation. For example with a separate keyspace or column families it is trivial to remove a single customer's data or move that data to another system. With one set of column families for all customers these types of actions become much more difficult as any change impacts all customers but perhaps that's the price we have to pay to scale. And I still think this needs to be made more prominent in the documentation. Thanks Phil -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Reasonable-range-for-the-max-number-of-tables-tp7596094p7596119.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Reasonable range for the max number of tables?
Multi-tenant remain a challenge - for most technologies. Yes, you can do what you suggest, but... you need to exercise great care and test and provision your cluster with great care. It's not like a free resource that scales wildly in all directions with no forethought or care. It is something that does work, sort of, but it wasn't one of the design goals or core strengths of Cassandra. IOW, it was/is more of a side effect rather than a core pattern. Anti-pattern simply means that it is not guaranteed to be a full-fledged, first-class feature. It means you can do it, and if it works well for you for your particular use case, great, but don't complain too loudly here if it doesn't. That said, anybody who has great success - or great failure - with multi-tenant for Cassandra, or any other technology, should definitely share their experiences here. And the bottom line is that dozens or low hundreds remains the recommended limit for tables in a single Cassandra cluster. Not a hard limit, but just a recommendation. Multi-tenant is an area of great interest, so I suspect Cassandra - and all other technologies - will see a lot of evolution in the coming years in this area. -- Jack Krupansky -Original Message- From: Phil Luckhurst Sent: Tuesday, August 5, 2014 4:09 AM To: cassandra-u...@incubator.apache.org Subject: Re: Reasonable range for the max number of tables? Is there any mention of this limitation anywhere in the Cassandra documentation? I don't see it mentioned in the 'Anti-patterns in Cassandra' section of the DataStax 2.0 documentation or anywhere else. When starting out with Cassandra as a store for a multi-tenant application it seems very attractive to segregate data for each tenant using a tenant specific keyspace each with their own set of tables. It's not until you start browsing through forums such as this that you find out that it isn't going to scale above a few tenants. If you want to be able to segregate customer data in Cassandra is it the accepted practice to have multiple Cassandra installations? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Reasonable-range-for-the-max-number-of-tables-tp7596094p7596106.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Reasonable range for the max number of tables?
- Use a keyspace per customer These effectively amount to the same thing and they both fall foul to the limit in the number of column families so do not scale. But then you can scale by moving some of the customers to a new cluster easily. If you keep everything in a single keyspace or - worse - if you do your multitenancy by prefixing row keys with customer ids of some kind, it won't be that easy, as you wrote later in your e-mail. M. Kind regards, Michał Michalski, michal.michal...@boxever.com On 5 August 2014 12:36, Phil Luckhurst phil.luckhu...@powerassure.com wrote: Hi Mark, Mark Reddy wrote To segregate customer data, you could: - Use customer specific column families under a single keyspace - Use a keyspace per customer These effectively amount to the same thing and they both fall foul to the limit in the number of column families so do not scale. Mark Reddy wrote - Use the same column families and have a column that identifies the customer. On the application layer ensure that there are sufficient checks so one customer can't read another customers data And while this gets around the column family limit it does not allow the same level of data segregation. For example with a separate keyspace or column families it is trivial to remove a single customer's data or move that data to another system. With one set of column families for all customers these types of actions become much more difficult as any change impacts all customers but perhaps that's the price we have to pay to scale. And I still think this needs to be made more prominent in the documentation. Thanks Phil -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Reasonable-range-for-the-max-number-of-tables-tp7596094p7596119.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Reasonable range for the max number of tables?
What's a reasonable range for the max number of tables? We have an append-only table system and I've been thinking of moving them to using hourly / partitioned tables. This means I can do things like easily drop older tables if I run out of disk space. It also means that I can fadvise the most recent tables and force them into table cache. But what's the max number of tables? If I'm doing hourly tables and have 30 days of them this would be 720 individual tables. That sounds reasonable… but what's the upper bound here? Kevin -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
Re: Reasonable range for the max number of tables?
On Mon, Aug 4, 2014 at 12:42 PM, Kevin Burton bur...@spinn3r.com wrote: What's a reasonable range for the max number of tables? We have an append-only table system and I've been thinking of moving them to using hourly / partitioned tables. Low Numbers of Hundreds This means I can do things like easily drop older tables if I run out of disk space. It also means that I can fadvise the most recent tables and force them into table cache. If you have to use fadvise to do this, you are almost certainly Doing It Wrong. But what's the max number of tables? If I'm doing hourly tables and have 30 days of them this would be 720 individual tables. Probably pushing it. =Rob
Re: Reasonable range for the max number of tables?
This means I can do things like easily drop older tables if I run out of disk space. It also means that I can fadvise the most recent tables and force them into table cache. If you have to use fadvise to do this, you are almost certainly Doing It Wrong. possibly.. fincore is also useful. This way you can see the cache % of each file… if I'm doing something funky the older files should not be in cache at all. and fadvise dontneed is your friend too. But what's the max number of tables? If I'm doing hourly tables and have 30 days of them this would be 720 individual tables. Probably pushing it. What are the bottlenecks here ? Is this an HDD issue because we're running them on SSD. I agree that that would be significant on HDD ... -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
Re: Reasonable range for the max number of tables?
On Mon, Aug 4, 2014 at 1:35 PM, Kevin Burton bur...@spinn3r.com wrote: What are the bottlenecks here ? Is this an HDD issue because we're running them on SSD. I agree that that would be significant on HDD ... Heap consumption. There are probably some JIRA tickets touching on the memory consumption of large numbers of tables, I suggest searching there... it has come up about 15 times on this list in the archives too.. :D =Rob
Re: Reasonable range for the max number of tables?
Read https://issues.apache.org/jira/browse/CASSANDRA-5935, especially the part about “having more than dozens or hundreds of tables defined is almost certainly a Bad Idea”. Either way, try a POC that simply creates your 720 tables and see what happens. And let us know how well it works, and whether SSD is somehow different from HDD. It’s not so much that there is a hard “limit”, so much as not being able to sleep soundly at night knowing that you’re “pushing it” so far that suddenly, out of nowhere, after things seeming to be fine fro an extended period, everything suddenly falls apart, with no warning. -- Jack Krupansky From: Robert Coli Sent: Monday, August 4, 2014 4:54 PM To: user@cassandra.apache.org Subject: Re: Reasonable range for the max number of tables? On Mon, Aug 4, 2014 at 1:35 PM, Kevin Burton bur...@spinn3r.com wrote: What are the bottlenecks here ? Is this an HDD issue because we're running them on SSD. I agree that that would be significant on HDD ... Heap consumption. There are probably some JIRA tickets touching on the memory consumption of large numbers of tables, I suggest searching there... it has come up about 15 times on this list in the archives too.. :D =Rob