Is per-table memory overhead due to SSTables or tables?
The conventional wisdom says that it's ideal to only use in the low hundreds in the number of tables with cassandra as each table can use 1MB or so of heap. So if you have 1000 tables you'd have 1GB of heap used (which is no fun). But is this an issue with the tables themselves or the SSTables? I think the root of this is the SSTables as all the arena overhead will be for the SSTables too and more SSTables means more overhead. So by adding more tables, you end up with more SSTables which means more heap memory. If I'm in correct then this means that Cassandra could benefit from table partitioning. Whereby you put all values in a specific region to a specific set of tables. So if you were storing log data, you could store it in hourly, or daily partitions, but view the table as one logical unit. the benefit here is that you could easily just drop the oldest data. So if you need to clean up data, you wouldn't have to drop the whole table, just a days worth of the data. And since that day is just one SSTable on disk, the drop would be easy.. no tombstones, just delete the whole SSTable. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
Re: Is per-table memory overhead due to SSTables or tables?
See https://issues.apache.org/jira/browse/CASSANDRA-5935 2.1 has a radically different implementation that side steps this (with off heap memtables), but if you really want lots of tables now you can do so as a trade off against GC behavior. The problem is not SSTables per se, but more potentially one memtable per CF (and with slab allocator that can/does cost 1M); I am not familiar enough with the code to know when you would have 1 memtable vs 0 memtable for a CF that isn’t currently actively used. Note also https://issues.apache.org/jira/browse/CASSANDRA-6602 and friends; there is definitely a need for efficient discarding of old data in event streams. On Aug 8, 2014, at 2:29 PM, Kevin Burton bur...@spinn3r.com wrote: The conventional wisdom says that it's ideal to only use in the low hundreds in the number of tables with cassandra as each table can use 1MB or so of heap. So if you have 1000 tables you'd have 1GB of heap used (which is no fun). But is this an issue with the tables themselves or the SSTables? I think the root of this is the SSTables as all the arena overhead will be for the SSTables too and more SSTables means more overhead. So by adding more tables, you end up with more SSTables which means more heap memory. If I'm in correct then this means that Cassandra could benefit from table partitioning. Whereby you put all values in a specific region to a specific set of tables. So if you were storing log data, you could store it in hourly, or daily partitions, but view the table as one logical unit. the benefit here is that you could easily just drop the oldest data. So if you need to clean up data, you wouldn't have to drop the whole table, just a days worth of the data. And since that day is just one SSTable on disk, the drop would be easy.. no tombstones, just delete the whole SSTable. -- Founder/CEO Spinn3r.com Location: San Francisco, CA blog: http://burtonator.wordpress.com … or check out my Google+ profile smime.p7s Description: S/MIME cryptographic signature
Re: Is per-table memory overhead due to SSTables or tables?
hm.. as a side note, it's amazing how much cassandra information is locked up in JIRAs… wonder if there's a way to compute automatically the JIRAs with important information. On Fri, Aug 8, 2014 at 5:14 PM, graham sanderson gra...@vast.com wrote: See https://issues.apache.org/jira/browse/CASSANDRA-5935 2.1 has a radically different implementation that side steps this (with off heap memtables), but if you really want lots of tables now you can do so as a trade off against GC behavior. The problem is not SSTables per se, but more potentially one memtable per CF (and with slab allocator that can/does cost 1M); I am not familiar enough with the code to know when you would have 1 memtable vs 0 memtable for a CF that isn’t currently actively used. Note also https://issues.apache.org/jira/browse/CASSANDRA-6602 and friends; there is definitely a need for efficient discarding of old data in event streams. On Aug 8, 2014, at 2:29 PM, Kevin Burton bur...@spinn3r.com wrote: The conventional wisdom says that it's ideal to only use in the low hundreds in the number of tables with cassandra as each table can use 1MB or so of heap. So if you have 1000 tables you'd have 1GB of heap used (which is no fun). But is this an issue with the tables themselves or the SSTables? I think the root of this is the SSTables as all the arena overhead will be for the SSTables too and more SSTables means more overhead. So by adding more tables, you end up with more SSTables which means more heap memory. If I'm in correct then this means that Cassandra could benefit from table partitioning. Whereby you put all values in a specific region to a specific set of tables. So if you were storing log data, you could store it in hourly, or daily partitions, but view the table as one logical unit. the benefit here is that you could easily just drop the oldest data. So if you need to clean up data, you wouldn't have to drop the whole table, just a days worth of the data. And since that day is just one SSTable on disk, the drop would be easy.. no tombstones, just delete the whole SSTable. -- Founder/CEO Spinn3r.com http://spinn3r.com/ Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com/ -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
Re: Is per-table memory overhead due to SSTables or tables?
google ;-) On Aug 8, 2014, at 7:33 PM, Kevin Burton bur...@spinn3r.com wrote: hm.. as a side note, it's amazing how much cassandra information is locked up in JIRAs… wonder if there's a way to compute automatically the JIRAs with important information. On Fri, Aug 8, 2014 at 5:14 PM, graham sanderson gra...@vast.com wrote: See https://issues.apache.org/jira/browse/CASSANDRA-5935 2.1 has a radically different implementation that side steps this (with off heap memtables), but if you really want lots of tables now you can do so as a trade off against GC behavior. The problem is not SSTables per se, but more potentially one memtable per CF (and with slab allocator that can/does cost 1M); I am not familiar enough with the code to know when you would have 1 memtable vs 0 memtable for a CF that isn’t currently actively used. Note also https://issues.apache.org/jira/browse/CASSANDRA-6602 and friends; there is definitely a need for efficient discarding of old data in event streams. On Aug 8, 2014, at 2:29 PM, Kevin Burton bur...@spinn3r.com wrote: The conventional wisdom says that it's ideal to only use in the low hundreds in the number of tables with cassandra as each table can use 1MB or so of heap. So if you have 1000 tables you'd have 1GB of heap used (which is no fun). But is this an issue with the tables themselves or the SSTables? I think the root of this is the SSTables as all the arena overhead will be for the SSTables too and more SSTables means more overhead. So by adding more tables, you end up with more SSTables which means more heap memory. If I'm in correct then this means that Cassandra could benefit from table partitioning. Whereby you put all values in a specific region to a specific set of tables. So if you were storing log data, you could store it in hourly, or daily partitions, but view the table as one logical unit. the benefit here is that you could easily just drop the oldest data. So if you need to clean up data, you wouldn't have to drop the whole table, just a days worth of the data. And since that day is just one SSTable on disk, the drop would be easy.. no tombstones, just delete the whole SSTable. -- Founder/CEO Spinn3r.com Location: San Francisco, CA blog: http://burtonator.wordpress.com … or check out my Google+ profile -- Founder/CEO Spinn3r.com Location: San Francisco, CA blog: http://burtonator.wordpress.com … or check out my Google+ profile smime.p7s Description: S/MIME cryptographic signature