Re: Combining all CFs into one big one

2011-05-02 Thread Tyler Hobbs
On Mon, May 2, 2011 at 5:05 AM, David Boxenhorn da...@taotown.com wrote:

 Wouldn't it be the case that the once-used rows in your batch process would
 quickly be traded out of the cache, and replaced by frequently-used rows?


Yes, and you'll pay a cache miss penalty for each of the replacements.


 This would be the case even if your batch process goes on for a long time,
 since caching is done on a row-by-row basis. In effect, it would mean that
 part of your cache is taken up by the batch process, much as if you
 dedicated a permanent cache to the batch - except that it isn't permanent,
 so it's better!


Right, but we didn't want to cache any of the batch CF in the first place,
because caching that CF is worth very little.  With separate CFs, we could
explicitly give it no cache.  Now we have no control over how much of the
cache it evicts.


Re: Combining all CFs into one big one

2011-05-02 Thread David Boxenhorn
I guess I'm still feeling fuzzy on this because my actual use-case isn't so
black-and-white. I don't have any CFs that are accessed purely, or even
mostly, in once-through batch mode. What I have is CFs with more and less
data, and CFs that are accessed more and less frequently.


On Mon, May 2, 2011 at 7:52 PM, Tyler Hobbs ty...@datastax.com wrote:

 On Mon, May 2, 2011 at 5:05 AM, David Boxenhorn da...@taotown.com wrote:

 Wouldn't it be the case that the once-used rows in your batch process
 would quickly be traded out of the cache, and replaced by frequently-used
 rows?


 Yes, and you'll pay a cache miss penalty for each of the replacements.


 This would be the case even if your batch process goes on for a long time,
 since caching is done on a row-by-row basis. In effect, it would mean that
 part of your cache is taken up by the batch process, much as if you
 dedicated a permanent cache to the batch - except that it isn't permanent,
 so it's better!


 Right, but we didn't want to cache any of the batch CF in the first place,
 because caching that CF is worth very little.  With separate CFs, we could
 explicitly give it no cache.  Now we have no control over how much of the
 cache it evicts.




Re: Combining all CFs into one big one

2011-05-02 Thread Tyler Hobbs
On Mon, May 2, 2011 at 12:06 PM, David Boxenhorn da...@taotown.com wrote:

 I guess I'm still feeling fuzzy on this because my actual use-case isn't so
 black-and-white. I don't have any CFs that are accessed purely, or even
 mostly, in once-through batch mode. What I have is CFs with more and less
 data, and CFs that are accessed more and less frequently.


I figured that was the case; the black-and-white CFs just make easier
examples.  Back at the beginning of the thread, I recommend merging CFs that
have similar data shapes and highly correlated access patterns.  For
example, if you had two CFs, user profile and group profile, that both
had fixed-length rows and tended to both be read for some type of
operations, those would be a good candidate for merging.  Just take the
ideas from the black-and-white examples and apply them in a more nuanced
way.

-- 
Tyler Hobbs
Software Engineer, DataStax http://datastax.com/
Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra
Python client library


Combining all CFs into one big one

2011-05-01 Thread David Boxenhorn
I'm having problems administering my cluster because I have too many CFs
(~40).

I'm thinking of combining them all into one big CF. I would prefix the
current CF name to the keys, repeat the CF name in a column, and index the
column (so I can loop over all rows, which I have to do sometimes, for some
CFs).

Can anyone think of any disadvantages to this approach?


Re: Combining all CFs into one big one

2011-05-01 Thread David Boxenhorn
Shouldn't these kinds of problems be solved by Cassandra? Isn't there a
maximum SSTable size?

On Sun, May 1, 2011 at 3:24 PM, shimi shim...@gmail.com wrote:

 Big sstables, long compactions, in major compaction you will need to have
 free disk space in the size of all the sstables (which you should have
 anyway).

 Shimi


 On Sun, May 1, 2011 at 2:03 PM, David Boxenhorn da...@taotown.com wrote:

 I'm having problems administering my cluster because I have too many CFs
 (~40).

 I'm thinking of combining them all into one big CF. I would prefix the
 current CF name to the keys, repeat the CF name in a column, and index the
 column (so I can loop over all rows, which I have to do sometimes, for some
 CFs).

 Can anyone think of any disadvantages to this approach?





Re: Combining all CFs into one big one

2011-05-01 Thread Jake Luciani
If you have N column families you need N * memtable size of RAM to support
this.  If that's not an option you can merge them into one as you suggest
but then you will have much larger SSTables, slower compactions, etc.  I
don't necessarily agree with Tyler that the OS cache will be less
effective... But I do agree that if the sizes of sstables are too large for
you then more hardware is the solution...

On Sun, May 1, 2011 at 1:24 PM, Tyler Hobbs ty...@datastax.com wrote:

 When you have a high number of CFs, it's a good idea to consider merging
 CFs with highly correlated access patterns and similar structure into one.
 It is *not* a good idea to merge all of your CFs into one (unless they all
 happen to meet this criteria). Here's why:

 Besides big compactions and long repairs that you can't break down into
 smaller pieces, the main problem here is that your caching will become much
 less efficient. The OS buffer cache will be less effective because rows from
 all of the CFs will be interspersed in the SSTables. You will no longer be
 able to tune the key or row cache to only cache frequently accessed data.
 Both of these will tend to cause a serious increase in latency for your hot
 data.

 Shouldn't these kinds of problems be solved by Cassandra?

 They are mainly solved by Cassandra's general solution to any performance
 problem: the addition of more nodes. There are tickets open to improve
 compaction strategies, put bounds on SSTable sizes, etc; for example,
 https://issues.apache.org/jira/browse/CASSANDRA-1608 , but the addition of
 more nodes is a reliable solution to problems of this nature.

 On Sun, May 1, 2011 at 7:28 AM, David Boxenhorn da...@taotown.com wrote:

 Shouldn't these kinds of problems be solved by Cassandra? Isn't there a
 maximum SSTable size?

 On Sun, May 1, 2011 at 3:24 PM, shimi shim...@gmail.com wrote:

 Big sstables, long compactions, in major compaction you will need to have
 free disk space in the size of all the sstables (which you should have
 anyway).

 Shimi


 On Sun, May 1, 2011 at 2:03 PM, David Boxenhorn da...@taotown.comwrote:

 I'm having problems administering my cluster because I have too many CFs
 (~40).

 I'm thinking of combining them all into one big CF. I would prefix the
 current CF name to the keys, repeat the CF name in a column, and index the
 column (so I can loop over all rows, which I have to do sometimes, for some
 CFs).

 Can anyone think of any disadvantages to this approach?






 --
 Tyler Hobbs
 Software Engineer, DataStax http://datastax.com/
 Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra
 Python client library




-- 
http://twitter.com/tjake


Re: Combining all CFs into one big one

2011-05-01 Thread shimi
On Sun, May 1, 2011 at 9:48 PM, Jake Luciani jak...@gmail.com wrote:

 If you have N column families you need N * memtable size of RAM to support
 this.  If that's not an option you can merge them into one as you suggest
 but then you will have much larger SSTables, slower compactions, etc.



 I don't necessarily agree with Tyler that the OS cache will be less
 effective... But I do agree that if the sizes of sstables are too large for
 you then more hardware is the solution...


If you merge CFs which are hardly accessed with one which are accessed
frequently, when you read the SSTable you load data that is hardly accessed
to the OS cache.

Another thing which you should be aware is that if you need to run any of
the nodetool cf tasks, and you really need it for a specific CF running it
on the specific CF is better and faster.

Shimi




 On Sun, May 1, 2011 at 1:24 PM, Tyler Hobbs ty...@datastax.com wrote:

 When you have a high number of CFs, it's a good idea to consider merging
 CFs with highly correlated access patterns and similar structure into one.
 It is *not* a good idea to merge all of your CFs into one (unless they all
 happen to meet this criteria). Here's why:

 Besides big compactions and long repairs that you can't break down into
 smaller pieces, the main problem here is that your caching will become much
 less efficient. The OS buffer cache will be less effective because rows from
 all of the CFs will be interspersed in the SSTables. You will no longer be
 able to tune the key or row cache to only cache frequently accessed data.
 Both of these will tend to cause a serious increase in latency for your hot
 data.

 Shouldn't these kinds of problems be solved by Cassandra?

 They are mainly solved by Cassandra's general solution to any performance
 problem: the addition of more nodes. There are tickets open to improve
 compaction strategies, put bounds on SSTable sizes, etc; for example,
 https://issues.apache.org/jira/browse/CASSANDRA-1608 , but the addition
 of more nodes is a reliable solution to problems of this nature.

 On Sun, May 1, 2011 at 7:28 AM, David Boxenhorn da...@taotown.comwrote:

 Shouldn't these kinds of problems be solved by Cassandra? Isn't there a
 maximum SSTable size?

 On Sun, May 1, 2011 at 3:24 PM, shimi shim...@gmail.com wrote:

 Big sstables, long compactions, in major compaction you will need to
 have free disk space in the size of all the sstables (which you should have
 anyway).

 Shimi


 On Sun, May 1, 2011 at 2:03 PM, David Boxenhorn da...@taotown.comwrote:

 I'm having problems administering my cluster because I have too many
 CFs (~40).

 I'm thinking of combining them all into one big CF. I would prefix the
 current CF name to the keys, repeat the CF name in a column, and index the
 column (so I can loop over all rows, which I have to do sometimes, for 
 some
 CFs).

 Can anyone think of any disadvantages to this approach?






 --
 Tyler Hobbs
 Software Engineer, DataStax http://datastax.com/
 Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra
 Python client library




 --
 http://twitter.com/tjake



Re: Combining all CFs into one big one

2011-05-01 Thread Tyler Hobbs
On Sun, May 1, 2011 at 2:16 PM, Jake Luciani jak...@gmail.com wrote:



 On Sun, May 1, 2011 at 2:58 PM, shimi shim...@gmail.com wrote:

 On Sun, May 1, 2011 at 9:48 PM, Jake Luciani jak...@gmail.com wrote:

 If you have N column families you need N * memtable size of RAM to
 support this.  If that's not an option you can merge them into one as you
 suggest but then you will have much larger SSTables, slower compactions,
 etc.



 I don't necessarily agree with Tyler that the OS cache will be less
 effective... But I do agree that if the sizes of sstables are too large for
 you then more hardware is the solution...


 If you merge CFs which are hardly accessed with one which are accessed
 frequently, when you read the SSTable you load data that is hardly accessed
 to the OS cache.


  Only the rows or portions of rows you read will be loaded into the OS
 cache.  Just because different rows are in the same file doesn't mean the
 entire file is loaded into the OS cache.  The bloom filter and index file
 will be loaded but those are not large files.


Right -- it does depend on the page size and the average amount of data
read.  The effect will be more pronounced on CFs with small rows that those
with wide rows.


Re: Combining all CFs into one big one

2011-05-01 Thread David Boxenhorn
If you had one big cache, wouldn't it be the case that it's mostly populated
with frequently accessed rows, and less populated with rarely accessed rows?

In fact, wouldn't one big cache dynamically and automatically give you
exactly what you want? If you try to partition the same amount of memory
manually, by guesswork, among many tables, aren't you always going to do a
worse job?


On Sun, May 1, 2011 at 10:43 PM, Tyler Hobbs ty...@datastax.com wrote:

 On Sun, May 1, 2011 at 2:16 PM, Jake Luciani jak...@gmail.com wrote:



 On Sun, May 1, 2011 at 2:58 PM, shimi shim...@gmail.com wrote:

 On Sun, May 1, 2011 at 9:48 PM, Jake Luciani jak...@gmail.com wrote:

 If you have N column families you need N * memtable size of RAM to
 support this.  If that's not an option you can merge them into one as you
 suggest but then you will have much larger SSTables, slower compactions,
 etc.



 I don't necessarily agree with Tyler that the OS cache will be less
 effective... But I do agree that if the sizes of sstables are too large for
 you then more hardware is the solution...


 If you merge CFs which are hardly accessed with one which are accessed
 frequently, when you read the SSTable you load data that is hardly accessed
 to the OS cache.


  Only the rows or portions of rows you read will be loaded into the OS
 cache.  Just because different rows are in the same file doesn't mean the
 entire file is loaded into the OS cache.  The bloom filter and index file
 will be loaded but those are not large files.


 Right -- it does depend on the page size and the average amount of data
 read.  The effect will be more pronounced on CFs with small rows that those
 with wide rows.



Re: Combining all CFs into one big one

2011-05-01 Thread Tyler Hobbs

 If you had one big cache, wouldn't it be the case that it's mostly
 populated with frequently accessed rows, and less populated with rarely
 accessed rows?


Yes.

In fact, wouldn't one big cache dynamically and automatically give you
 exactly what you want? If you try to partition the same amount of memory
 manually, by guesswork, among many tables, aren't you always going to do a
 worse job?


Suppose you have one CF that's used constantly through interaction by
users.  Suppose you have another CF that's only used periodically by a batch
process, you tend to access most or all of the rows during the batch
process, and it's too large to cache all of the rows.  Normally, you would
dedicate cache space to the first CF as anything with human interaction
tends to have good temporal locality and you want to keep latencies there
low.  On the other hand, caching the second CF provides little to no real
benefit.  When you combine these two CFs, every time your batch process
runs, rows from the second CF will populate the cache and will cause
eviction of rows from the first CF, even though having those rows in the
cache provides little benefit to you.

As another example, if you mix a CF with wide rows and a CF with small rows,
you no longer have the option of using a row cache, even if it makes great
sense for the small-row CF data.

Knowledge of data and access patterns gives you a very good advantage when
it comes to caching your data effectively.

-- 
Tyler Hobbs
Software Engineer, DataStax http://datastax.com/
Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra
Python client library