Is per-table memory overhead due to SSTables or tables?

2014-08-08 Thread Kevin Burton
The conventional wisdom says that it's ideal to only use in the low
hundreds in the number of tables with cassandra as each table can use 1MB
or so of heap.  So if you have 1000 tables you'd have 1GB of heap used
(which is no fun).

But is this an issue with the tables themselves or the SSTables?

I think the root of this is the SSTables as all the arena overhead will be
for the SSTables too and more SSTables means more overhead.

So by adding more tables, you end up with more SSTables which means more
heap memory.

If I'm in correct then this means that Cassandra could benefit from table
partitioning.  Whereby you put all values in a specific region to a
specific set of tables.

So if you were storing log data, you could store it in hourly, or daily
partitions, but view the table as one logical unit.

the benefit here is that you could easily just drop the oldest data.  So if
you need to clean up data, you wouldn't have to drop the whole table, just
a days worth of the data.

And since that day is just one SSTable on disk, the drop would be easy.. no
tombstones, just delete the whole SSTable.



-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com


Re: Is per-table memory overhead due to SSTables or tables?

2014-08-08 Thread graham sanderson
See https://issues.apache.org/jira/browse/CASSANDRA-5935

2.1 has a radically different implementation that side steps this (with off 
heap memtables), but if you really want lots of tables now you can do so as a 
trade off against GC behavior.

The problem is not SSTables per se, but more potentially one memtable per CF 
(and with slab allocator that can/does cost 1M); I am not familiar enough with 
the code to know when you would have 1 memtable vs 0 memtable for a CF that 
isn’t currently actively used.

Note also https://issues.apache.org/jira/browse/CASSANDRA-6602 and friends; 
there is definitely a need for efficient discarding of old data in event 
streams.


On Aug 8, 2014, at 2:29 PM, Kevin Burton bur...@spinn3r.com wrote:

 The conventional wisdom says that it's ideal to only use in the low 
 hundreds in the number of tables with cassandra as each table can use 1MB or 
 so of heap.  So if you have 1000 tables you'd have 1GB of heap used (which is 
 no fun).
 
 But is this an issue with the tables themselves or the SSTables?
 
 I think the root of this is the SSTables as all the arena overhead will be 
 for the SSTables too and more SSTables means more overhead.
 
 So by adding more tables, you end up with more SSTables which means more heap 
 memory.
 
 If I'm in correct then this means that Cassandra could benefit from table 
 partitioning.  Whereby you put all values in a specific region to a specific 
 set of tables.
 
 So if you were storing log data, you could store it in hourly, or daily 
 partitions, but view the table as one logical unit.
 
 the benefit here is that you could easily just drop the oldest data.  So if 
 you need to clean up data, you wouldn't have to drop the whole table, just a 
 days worth of the data. 
 
 And since that day is just one SSTable on disk, the drop would be easy.. no 
 tombstones, just delete the whole SSTable.
 
 
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Is per-table memory overhead due to SSTables or tables?

2014-08-08 Thread Kevin Burton
hm.. as a side note, it's amazing how much cassandra information is locked
up in JIRAs… wonder if there's a way to compute automatically the JIRAs
with important information.


On Fri, Aug 8, 2014 at 5:14 PM, graham sanderson gra...@vast.com wrote:

 See https://issues.apache.org/jira/browse/CASSANDRA-5935

 2.1 has a radically different implementation that side steps this (with
 off heap memtables), but if you really want lots of tables now you can do
 so as a trade off against GC behavior.

 The problem is not SSTables per se, but more potentially one memtable per
 CF (and with slab allocator that can/does cost 1M); I am not familiar
 enough with the code to know when you would have 1 memtable vs 0 memtable
 for a CF that isn’t currently actively used.

 Note also https://issues.apache.org/jira/browse/CASSANDRA-6602 and
 friends; there is definitely a need for efficient discarding of old data in
 event streams.


 On Aug 8, 2014, at 2:29 PM, Kevin Burton bur...@spinn3r.com wrote:

 The conventional wisdom says that it's ideal to only use in the low
 hundreds in the number of tables with cassandra as each table can use 1MB
 or so of heap.  So if you have 1000 tables you'd have 1GB of heap used
 (which is no fun).

 But is this an issue with the tables themselves or the SSTables?

 I think the root of this is the SSTables as all the arena overhead will be
 for the SSTables too and more SSTables means more overhead.

 So by adding more tables, you end up with more SSTables which means more
 heap memory.

 If I'm in correct then this means that Cassandra could benefit from table
 partitioning.  Whereby you put all values in a specific region to a
 specific set of tables.

 So if you were storing log data, you could store it in hourly, or daily
 partitions, but view the table as one logical unit.

 the benefit here is that you could easily just drop the oldest data.  So
 if you need to clean up data, you wouldn't have to drop the whole table,
 just a days worth of the data.

 And since that day is just one SSTable on disk, the drop would be easy..
 no tombstones, just delete the whole SSTable.



 --

 Founder/CEO Spinn3r.com http://spinn3r.com/
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
  … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com/





-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com


Re: Is per-table memory overhead due to SSTables or tables?

2014-08-08 Thread graham sanderson
google ;-)

On Aug 8, 2014, at 7:33 PM, Kevin Burton bur...@spinn3r.com wrote:

 hm.. as a side note, it's amazing how much cassandra information is locked up 
 in JIRAs… wonder if there's a way to compute automatically the JIRAs with 
 important information.
 
 
 On Fri, Aug 8, 2014 at 5:14 PM, graham sanderson gra...@vast.com wrote:
 See https://issues.apache.org/jira/browse/CASSANDRA-5935
 
 2.1 has a radically different implementation that side steps this (with off 
 heap memtables), but if you really want lots of tables now you can do so as a 
 trade off against GC behavior.
 
 The problem is not SSTables per se, but more potentially one memtable per CF 
 (and with slab allocator that can/does cost 1M); I am not familiar enough 
 with the code to know when you would have 1 memtable vs 0 memtable for a CF 
 that isn’t currently actively used.
 
 Note also https://issues.apache.org/jira/browse/CASSANDRA-6602 and friends; 
 there is definitely a need for efficient discarding of old data in event 
 streams.
 
 
 On Aug 8, 2014, at 2:29 PM, Kevin Burton bur...@spinn3r.com wrote:
 
 The conventional wisdom says that it's ideal to only use in the low 
 hundreds in the number of tables with cassandra as each table can use 1MB 
 or so of heap.  So if you have 1000 tables you'd have 1GB of heap used 
 (which is no fun).
 
 But is this an issue with the tables themselves or the SSTables?
 
 I think the root of this is the SSTables as all the arena overhead will be 
 for the SSTables too and more SSTables means more overhead.
 
 So by adding more tables, you end up with more SSTables which means more 
 heap memory.
 
 If I'm in correct then this means that Cassandra could benefit from table 
 partitioning.  Whereby you put all values in a specific region to a specific 
 set of tables.
 
 So if you were storing log data, you could store it in hourly, or daily 
 partitions, but view the table as one logical unit.
 
 the benefit here is that you could easily just drop the oldest data.  So if 
 you need to clean up data, you wouldn't have to drop the whole table, just a 
 days worth of the data. 
 
 And since that day is just one SSTable on disk, the drop would be easy.. no 
 tombstones, just delete the whole SSTable.
 
 
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 
 
 
 
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 



smime.p7s
Description: S/MIME cryptographic signature