Re: Combining all CFs into one big one
On Mon, May 2, 2011 at 5:05 AM, David Boxenhorn da...@taotown.com wrote: Wouldn't it be the case that the once-used rows in your batch process would quickly be traded out of the cache, and replaced by frequently-used rows? Yes, and you'll pay a cache miss penalty for each of the replacements. This would be the case even if your batch process goes on for a long time, since caching is done on a row-by-row basis. In effect, it would mean that part of your cache is taken up by the batch process, much as if you dedicated a permanent cache to the batch - except that it isn't permanent, so it's better! Right, but we didn't want to cache any of the batch CF in the first place, because caching that CF is worth very little. With separate CFs, we could explicitly give it no cache. Now we have no control over how much of the cache it evicts.
Re: Combining all CFs into one big one
I guess I'm still feeling fuzzy on this because my actual use-case isn't so black-and-white. I don't have any CFs that are accessed purely, or even mostly, in once-through batch mode. What I have is CFs with more and less data, and CFs that are accessed more and less frequently. On Mon, May 2, 2011 at 7:52 PM, Tyler Hobbs ty...@datastax.com wrote: On Mon, May 2, 2011 at 5:05 AM, David Boxenhorn da...@taotown.com wrote: Wouldn't it be the case that the once-used rows in your batch process would quickly be traded out of the cache, and replaced by frequently-used rows? Yes, and you'll pay a cache miss penalty for each of the replacements. This would be the case even if your batch process goes on for a long time, since caching is done on a row-by-row basis. In effect, it would mean that part of your cache is taken up by the batch process, much as if you dedicated a permanent cache to the batch - except that it isn't permanent, so it's better! Right, but we didn't want to cache any of the batch CF in the first place, because caching that CF is worth very little. With separate CFs, we could explicitly give it no cache. Now we have no control over how much of the cache it evicts.
Re: Combining all CFs into one big one
On Mon, May 2, 2011 at 12:06 PM, David Boxenhorn da...@taotown.com wrote: I guess I'm still feeling fuzzy on this because my actual use-case isn't so black-and-white. I don't have any CFs that are accessed purely, or even mostly, in once-through batch mode. What I have is CFs with more and less data, and CFs that are accessed more and less frequently. I figured that was the case; the black-and-white CFs just make easier examples. Back at the beginning of the thread, I recommend merging CFs that have similar data shapes and highly correlated access patterns. For example, if you had two CFs, user profile and group profile, that both had fixed-length rows and tended to both be read for some type of operations, those would be a good candidate for merging. Just take the ideas from the black-and-white examples and apply them in a more nuanced way. -- Tyler Hobbs Software Engineer, DataStax http://datastax.com/ Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra Python client library
Combining all CFs into one big one
I'm having problems administering my cluster because I have too many CFs (~40). I'm thinking of combining them all into one big CF. I would prefix the current CF name to the keys, repeat the CF name in a column, and index the column (so I can loop over all rows, which I have to do sometimes, for some CFs). Can anyone think of any disadvantages to this approach?
Re: Combining all CFs into one big one
Shouldn't these kinds of problems be solved by Cassandra? Isn't there a maximum SSTable size? On Sun, May 1, 2011 at 3:24 PM, shimi shim...@gmail.com wrote: Big sstables, long compactions, in major compaction you will need to have free disk space in the size of all the sstables (which you should have anyway). Shimi On Sun, May 1, 2011 at 2:03 PM, David Boxenhorn da...@taotown.com wrote: I'm having problems administering my cluster because I have too many CFs (~40). I'm thinking of combining them all into one big CF. I would prefix the current CF name to the keys, repeat the CF name in a column, and index the column (so I can loop over all rows, which I have to do sometimes, for some CFs). Can anyone think of any disadvantages to this approach?
Re: Combining all CFs into one big one
If you have N column families you need N * memtable size of RAM to support this. If that's not an option you can merge them into one as you suggest but then you will have much larger SSTables, slower compactions, etc. I don't necessarily agree with Tyler that the OS cache will be less effective... But I do agree that if the sizes of sstables are too large for you then more hardware is the solution... On Sun, May 1, 2011 at 1:24 PM, Tyler Hobbs ty...@datastax.com wrote: When you have a high number of CFs, it's a good idea to consider merging CFs with highly correlated access patterns and similar structure into one. It is *not* a good idea to merge all of your CFs into one (unless they all happen to meet this criteria). Here's why: Besides big compactions and long repairs that you can't break down into smaller pieces, the main problem here is that your caching will become much less efficient. The OS buffer cache will be less effective because rows from all of the CFs will be interspersed in the SSTables. You will no longer be able to tune the key or row cache to only cache frequently accessed data. Both of these will tend to cause a serious increase in latency for your hot data. Shouldn't these kinds of problems be solved by Cassandra? They are mainly solved by Cassandra's general solution to any performance problem: the addition of more nodes. There are tickets open to improve compaction strategies, put bounds on SSTable sizes, etc; for example, https://issues.apache.org/jira/browse/CASSANDRA-1608 , but the addition of more nodes is a reliable solution to problems of this nature. On Sun, May 1, 2011 at 7:28 AM, David Boxenhorn da...@taotown.com wrote: Shouldn't these kinds of problems be solved by Cassandra? Isn't there a maximum SSTable size? On Sun, May 1, 2011 at 3:24 PM, shimi shim...@gmail.com wrote: Big sstables, long compactions, in major compaction you will need to have free disk space in the size of all the sstables (which you should have anyway). Shimi On Sun, May 1, 2011 at 2:03 PM, David Boxenhorn da...@taotown.comwrote: I'm having problems administering my cluster because I have too many CFs (~40). I'm thinking of combining them all into one big CF. I would prefix the current CF name to the keys, repeat the CF name in a column, and index the column (so I can loop over all rows, which I have to do sometimes, for some CFs). Can anyone think of any disadvantages to this approach? -- Tyler Hobbs Software Engineer, DataStax http://datastax.com/ Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra Python client library -- http://twitter.com/tjake
Re: Combining all CFs into one big one
On Sun, May 1, 2011 at 9:48 PM, Jake Luciani jak...@gmail.com wrote: If you have N column families you need N * memtable size of RAM to support this. If that's not an option you can merge them into one as you suggest but then you will have much larger SSTables, slower compactions, etc. I don't necessarily agree with Tyler that the OS cache will be less effective... But I do agree that if the sizes of sstables are too large for you then more hardware is the solution... If you merge CFs which are hardly accessed with one which are accessed frequently, when you read the SSTable you load data that is hardly accessed to the OS cache. Another thing which you should be aware is that if you need to run any of the nodetool cf tasks, and you really need it for a specific CF running it on the specific CF is better and faster. Shimi On Sun, May 1, 2011 at 1:24 PM, Tyler Hobbs ty...@datastax.com wrote: When you have a high number of CFs, it's a good idea to consider merging CFs with highly correlated access patterns and similar structure into one. It is *not* a good idea to merge all of your CFs into one (unless they all happen to meet this criteria). Here's why: Besides big compactions and long repairs that you can't break down into smaller pieces, the main problem here is that your caching will become much less efficient. The OS buffer cache will be less effective because rows from all of the CFs will be interspersed in the SSTables. You will no longer be able to tune the key or row cache to only cache frequently accessed data. Both of these will tend to cause a serious increase in latency for your hot data. Shouldn't these kinds of problems be solved by Cassandra? They are mainly solved by Cassandra's general solution to any performance problem: the addition of more nodes. There are tickets open to improve compaction strategies, put bounds on SSTable sizes, etc; for example, https://issues.apache.org/jira/browse/CASSANDRA-1608 , but the addition of more nodes is a reliable solution to problems of this nature. On Sun, May 1, 2011 at 7:28 AM, David Boxenhorn da...@taotown.comwrote: Shouldn't these kinds of problems be solved by Cassandra? Isn't there a maximum SSTable size? On Sun, May 1, 2011 at 3:24 PM, shimi shim...@gmail.com wrote: Big sstables, long compactions, in major compaction you will need to have free disk space in the size of all the sstables (which you should have anyway). Shimi On Sun, May 1, 2011 at 2:03 PM, David Boxenhorn da...@taotown.comwrote: I'm having problems administering my cluster because I have too many CFs (~40). I'm thinking of combining them all into one big CF. I would prefix the current CF name to the keys, repeat the CF name in a column, and index the column (so I can loop over all rows, which I have to do sometimes, for some CFs). Can anyone think of any disadvantages to this approach? -- Tyler Hobbs Software Engineer, DataStax http://datastax.com/ Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra Python client library -- http://twitter.com/tjake
Re: Combining all CFs into one big one
On Sun, May 1, 2011 at 2:16 PM, Jake Luciani jak...@gmail.com wrote: On Sun, May 1, 2011 at 2:58 PM, shimi shim...@gmail.com wrote: On Sun, May 1, 2011 at 9:48 PM, Jake Luciani jak...@gmail.com wrote: If you have N column families you need N * memtable size of RAM to support this. If that's not an option you can merge them into one as you suggest but then you will have much larger SSTables, slower compactions, etc. I don't necessarily agree with Tyler that the OS cache will be less effective... But I do agree that if the sizes of sstables are too large for you then more hardware is the solution... If you merge CFs which are hardly accessed with one which are accessed frequently, when you read the SSTable you load data that is hardly accessed to the OS cache. Only the rows or portions of rows you read will be loaded into the OS cache. Just because different rows are in the same file doesn't mean the entire file is loaded into the OS cache. The bloom filter and index file will be loaded but those are not large files. Right -- it does depend on the page size and the average amount of data read. The effect will be more pronounced on CFs with small rows that those with wide rows.
Re: Combining all CFs into one big one
If you had one big cache, wouldn't it be the case that it's mostly populated with frequently accessed rows, and less populated with rarely accessed rows? In fact, wouldn't one big cache dynamically and automatically give you exactly what you want? If you try to partition the same amount of memory manually, by guesswork, among many tables, aren't you always going to do a worse job? On Sun, May 1, 2011 at 10:43 PM, Tyler Hobbs ty...@datastax.com wrote: On Sun, May 1, 2011 at 2:16 PM, Jake Luciani jak...@gmail.com wrote: On Sun, May 1, 2011 at 2:58 PM, shimi shim...@gmail.com wrote: On Sun, May 1, 2011 at 9:48 PM, Jake Luciani jak...@gmail.com wrote: If you have N column families you need N * memtable size of RAM to support this. If that's not an option you can merge them into one as you suggest but then you will have much larger SSTables, slower compactions, etc. I don't necessarily agree with Tyler that the OS cache will be less effective... But I do agree that if the sizes of sstables are too large for you then more hardware is the solution... If you merge CFs which are hardly accessed with one which are accessed frequently, when you read the SSTable you load data that is hardly accessed to the OS cache. Only the rows or portions of rows you read will be loaded into the OS cache. Just because different rows are in the same file doesn't mean the entire file is loaded into the OS cache. The bloom filter and index file will be loaded but those are not large files. Right -- it does depend on the page size and the average amount of data read. The effect will be more pronounced on CFs with small rows that those with wide rows.
Re: Combining all CFs into one big one
If you had one big cache, wouldn't it be the case that it's mostly populated with frequently accessed rows, and less populated with rarely accessed rows? Yes. In fact, wouldn't one big cache dynamically and automatically give you exactly what you want? If you try to partition the same amount of memory manually, by guesswork, among many tables, aren't you always going to do a worse job? Suppose you have one CF that's used constantly through interaction by users. Suppose you have another CF that's only used periodically by a batch process, you tend to access most or all of the rows during the batch process, and it's too large to cache all of the rows. Normally, you would dedicate cache space to the first CF as anything with human interaction tends to have good temporal locality and you want to keep latencies there low. On the other hand, caching the second CF provides little to no real benefit. When you combine these two CFs, every time your batch process runs, rows from the second CF will populate the cache and will cause eviction of rows from the first CF, even though having those rows in the cache provides little benefit to you. As another example, if you mix a CF with wide rows and a CF with small rows, you no longer have the option of using a row cache, even if it makes great sense for the small-row CF data. Knowledge of data and access patterns gives you a very good advantage when it comes to caching your data effectively. -- Tyler Hobbs Software Engineer, DataStax http://datastax.com/ Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra Python client library