Re: How to force GC in Cassandra?
Thanks Jonathan! I looked into your code and guessed that compaction is the one that cleans all deleted columns from sstable. -Weijun On Fri, Mar 12, 2010 at 12:05 PM, Jonathan Ellis jbel...@gmail.com wrote: I think you mean compaction? You can use nodeprobe / nodetool for that. http://wiki.apache.org/cassandra/NodeProbe On Fri, Mar 12, 2010 at 12:40 PM, Weijun Li weiju...@gmail.com wrote: Suppose I insert a lot of new items but also delete a lot of new items daily, it will be ideal if I can force GC to happen during mid night (when traffic is low). Is there any way to manually force GC to be executed? In this way I can add a cronjob to trigger gc in mid night. I tried nodetool and the JMX interface but they don't seem to have that. -Weijun
How to force GC in Cassandra?
Suppose I insert a lot of new items but also delete a lot of new items daily, it will be ideal if I can force GC to happen during mid night (when traffic is low). Is there any way to manually force GC to be executed? In this way I can add a cronjob to trigger gc in mid night. I tried nodetool and the JMX interface but they don't seem to have that. -Weijun
Re: Strategy to delete/expire keys in cassandra
Hi Sylvain, I applied your patch to 0.5 but it seems that it's not compilable: 1) column.getTtl() is no defined in RowMutation.java public static RowMutation getRowMutation(String table, String key, MapString, ListColumnOrSuperColumn cfmap) { RowMutation rm = new RowMutation(table, key.trim()); for (Map.EntryString, ListColumnOrSuperColumn entry : cfmap.entrySet()) { String cfName = entry.getKey(); for (ColumnOrSuperColumn cosc : entry.getValue()) { if (cosc.column == null) { assert cosc.super_column != null; for (org.apache.cassandra.service.Column column : cosc.super_column.columns) { rm.add(new QueryPath(cfName, cosc.super_column.name, column.name), column.value, column.timestamp, column.getTtl()); } } else { assert cosc.super_column == null; rm.add(new QueryPath(cfName, null, cosc.column.name), cosc.column.value, cosc.column.timestamp, cosc.column.getTtl()); } } } return rm; } 2) CassandraServer.java: Column.setTtl() is not defined. if (column instanceof ExpiringColumn) { thrift_column.setTtl(((ExpiringColumn) column).getTimeToLive()); } 3) CliClient.java: type mismatch for ColumnParent thriftClient_.insert(tableName, key, new ColumnParent(columnFamily, superColumnName), new Column(columnName, value.getBytes(), System.currentTimeMillis()), ConsistencyLevel.ONE); It seems that the patch doesn't add getTtl()/setTtl() stuff to Column.java? Thanks, -Weijun -Original Message- From: Sylvain Lebresne [mailto:sylv...@yakaz.com] Sent: Thursday, February 25, 2010 2:23 AM To: Weijun Li Cc: cassandra-user@incubator.apache.org Subject: Re: Strategy to delete/expire keys in cassandra Hi, Should I just run command (in Cassandra 0.5 source folder?) like: patch –p1 –i 0001-Add-new-ExpiringColumn-class.patch for all of the five patches in your ticket? Well, actually I lied. The patches were made for a version a little after 0.5. If you really want to try, I attach a version of those patches that (should) work with 0.5 (There is only the 3 first patch, but the fourth one is for tests so not necessary per se). Apply them with your patch command. Still, to compile that you will have to regenerate the thrift java interface (with ant gen-thrift-java), but for that you will have to install the right svn revision of thrift (which is libthrift-r820831 for 0.5). And if you manage to make it work, you will have to digg in cassandra.thrift as it make change to it. In the end, remember that this is not an official patch yet and it *will not* make it in Cassandra in its current form. All I can tell you is that I need those expiring columns for quite some of my usage and I will do what I can to make this feature included if and when possible. Also what’s your opinion on extending ExpiringColumn to expire a key completely? Otherwise it will be difficult to track what are expired or old rows in Cassandra. I'm not sure how to make full rows (or even full superColumns for that matter) expire. What if you set a row to expire after some time and add new columns before this expiration ? Should you update the expiration of the row ? Which is to say that a row will expires when it's last column expire, which is almost what you get with expiring column. The one thing you may want though is that when all the columns of a row expire (or, to be precise, get physically deleted), the row itself is deleted. Looking at the code, I'm not convince this happen and I'm not sure why. -- Sylvain
Re: Strategy to delete/expire keys in cassandra
Never mind. Figured out I forgot to compile thrift :) Thanks, -Weijun On Wed, Mar 10, 2010 at 1:43 PM, Weijun Li weiju...@gmail.com wrote: Hi Sylvain, I applied your patch to 0.5 but it seems that it's not compilable: 1) column.getTtl() is no defined in RowMutation.java public static RowMutation getRowMutation(String table, String key, MapString, ListColumnOrSuperColumn cfmap) { RowMutation rm = new RowMutation(table, key.trim()); for (Map.EntryString, ListColumnOrSuperColumn entry : cfmap.entrySet()) { String cfName = entry.getKey(); for (ColumnOrSuperColumn cosc : entry.getValue()) { if (cosc.column == null) { assert cosc.super_column != null; for (org.apache.cassandra.service.Column column : cosc.super_column.columns) { rm.add(new QueryPath(cfName, cosc.super_column.name, column.name), column.value, column.timestamp, column.getTtl()); } } else { assert cosc.super_column == null; rm.add(new QueryPath(cfName, null, cosc.column.name), cosc.column.value, cosc.column.timestamp, cosc.column.getTtl()); } } } return rm; } 2) CassandraServer.java: Column.setTtl() is not defined. if (column instanceof ExpiringColumn) { thrift_column.setTtl(((ExpiringColumn) column).getTimeToLive()); } 3) CliClient.java: type mismatch for ColumnParent thriftClient_.insert(tableName, key, new ColumnParent(columnFamily, superColumnName), new Column(columnName, value.getBytes(), System.currentTimeMillis()), ConsistencyLevel.ONE); It seems that the patch doesn't add getTtl()/setTtl() stuff to Column.java? Thanks, -Weijun -Original Message- From: Sylvain Lebresne [mailto:sylv...@yakaz.com] Sent: Thursday, February 25, 2010 2:23 AM To: Weijun Li Cc: cassandra-user@incubator.apache.org Subject: Re: Strategy to delete/expire keys in cassandra Hi, Should I just run command (in Cassandra 0.5 source folder?) like: patch –p1 –i 0001-Add-new-ExpiringColumn-class.patch for all of the five patches in your ticket? Well, actually I lied. The patches were made for a version a little after 0.5. If you really want to try, I attach a version of those patches that (should) work with 0.5 (There is only the 3 first patch, but the fourth one is for tests so not necessary per se). Apply them with your patch command. Still, to compile that you will have to regenerate the thrift java interface (with ant gen-thrift-java), but for that you will have to install the right svn revision of thrift (which is libthrift-r820831 for 0.5). And if you manage to make it work, you will have to digg in cassandra.thrift as it make change to it. In the end, remember that this is not an official patch yet and it *will not* make it in Cassandra in its current form. All I can tell you is that I need those expiring columns for quite some of my usage and I will do what I can to make this feature included if and when possible. Also what’s your opinion on extending ExpiringColumn to expire a key completely? Otherwise it will be difficult to track what are expired or old rows in Cassandra. I'm not sure how to make full rows (or even full superColumns for that matter) expire. What if you set a row to expire after some time and add new columns before this expiration ? Should you update the expiration of the row ? Which is to say that a row will expires when it's last column expire, which is almost what you get with expiring column. The one thing you may want though is that when all the columns of a row expire (or, to be precise, get physically deleted), the row itself is deleted. Looking at the code, I'm not convince this happen and I'm not sure why. -- Sylvain
RE: Strategy to delete/expire keys in cassandra
Thanks for the patch Sylvain! I remember during build Cassandra re-generates the thrift java code (in src/) with a libthrift jar, is this correct? Here's my use case: 1) Write/read ratio is close to 1:1 2) High volume of traffic and I want low read latency (e.g., 40ms). That's why I'm testing a build with row-level cache and mmap (I think Jonathan is right that mmap does help with performance). 3) A row should expire if its last modified time is too old so we don't need to worry about scanning all keys to clean up old items. So yes if you write to a row the last-modified-time should be updated as well. 4) (nice to have) support for range scan (key iteration) with RP. So ideally a row should have a last modified time field. Or, I can use one column to record the last modified time (this means each write to a row will be followed by another one to update the last modified column, which is kind of ugly). For the simplest case: suppose each row just have one ExpiringColumn, will the row be deleted automatically if it has no column associated with it? Does it make sense for Cassandra to keep a row without any column? Please let me know if the following plan will work or not: 1) Manually apply your patch to the trunk build that I use (which has row-level cache and mmap). If will be nice if you can throw some words about the design flow of your ExpringColum :-) 2) Find the API entry point for deleting a row, and modify the expiration handler (suppose you have one) of ExpiringColumn to call the key delete method if the key has no other columns (if it doesn't happen for now). How do you trigger the expiration check for a ExpiringColumn? Upon hit of a column? Or use a timer to scan all columns for expiration?? Thanks, -Weijun -Original Message- From: Sylvain Lebresne [mailto:sylv...@yakaz.com] Sent: Thursday, February 25, 2010 2:23 AM To: Weijun Li Cc: cassandra-user@incubator.apache.org Subject: Re: Strategy to delete/expire keys in cassandra Hi, Should I just run command (in Cassandra 0.5 source folder?) like: patch p1 i 0001-Add-new-ExpiringColumn-class.patch for all of the five patches in your ticket? Well, actually I lied. The patches were made for a version a little after 0.5. If you really want to try, I attach a version of those patches that (should) work with 0.5 (There is only the 3 first patch, but the fourth one is for tests so not necessary per se). Apply them with your patch command. Still, to compile that you will have to regenerate the thrift java interface (with ant gen-thrift-java), but for that you will have to install the right svn revision of thrift (which is libthrift-r820831 for 0.5). And if you manage to make it work, you will have to digg in cassandra.thrift as it make change to it. In the end, remember that this is not an official patch yet and it *will not* make it in Cassandra in its current form. All I can tell you is that I need those expiring columns for quite some of my usage and I will do what I can to make this feature included if and when possible. Also whats your opinion on extending ExpiringColumn to expire a key completely? Otherwise it will be difficult to track what are expired or old rows in Cassandra. I'm not sure how to make full rows (or even full superColumns for that matter) expire. What if you set a row to expire after some time and add new columns before this expiration ? Should you update the expiration of the row ? Which is to say that a row will expires when it's last column expire, which is almost what you get with expiring column. The one thing you may want though is that when all the columns of a row expire (or, to be precise, get physically deleted), the row itself is deleted. Looking at the code, I'm not convince this happen and I'm not sure why. -- Sylvain From: Weijun Li [mailto:weiju...@gmail.com] Sent: Tuesday, February 23, 2010 6:18 PM To: cassandra-user@incubator.apache.org Subject: Re: Strategy to delete/expire keys in cassandra Thanks for the answer. A dumb question: how did you apply the patch file to 0.5 source? The link you gave doesn't mention that the patch is for 0.5?? Also, this ExpiringColumn feature doesn't seem to expire key/row, meaning the number of keys will keep grow (even if you drop columns for them) unless you delete them. In your case, how do you manage deleting/expiring keys from Cassandra? Do you keep a list of keys somewhere and go through them once a while? Thanks, -Weijun On Tue, Feb 23, 2010 at 2:26 AM, Sylvain Lebresne sylv...@yakaz.com wrote: Hi, Maybe the following ticket/patch may be what you are looking for: https://issues.apache.org/jira/browse/CASSANDRA-699 It's flagged for 0.7 but as it breaks the API (and if I understand correctly the release plan) it may not make it in cassandra before 0.8 (and the patch will have to change to accommodate the change that will be made to the internals in 0.7). Anyway, what I can at least tell you
RE: Strategy to delete/expire keys in cassandra
Hi Sylvain, I just noticed that you are the one that implemented the Expiring Column feature. Could you please help on my questions? Should I just run command (in Cassandra 0.5 source folder?) like: patch -p1 -i 0001-Add-new-ExpiringColumn-class.patch for all of the five patches in your ticket? Also what's your opinion on extending ExpiringColumn to expire a key completely? Otherwise it will be difficult to track what are expired or old rows in Cassandra. Thanks, -Weijun From: Weijun Li [mailto:weiju...@gmail.com] Sent: Tuesday, February 23, 2010 6:18 PM To: cassandra-user@incubator.apache.org Subject: Re: Strategy to delete/expire keys in cassandra Thanks for the answer. A dumb question: how did you apply the patch file to 0.5 source? The link you gave doesn't mention that the patch is for 0.5?? Also, this ExpiringColumn feature doesn't seem to expire key/row, meaning the number of keys will keep grow (even if you drop columns for them) unless you delete them. In your case, how do you manage deleting/expiring keys from Cassandra? Do you keep a list of keys somewhere and go through them once a while? Thanks, -Weijun On Tue, Feb 23, 2010 at 2:26 AM, Sylvain Lebresne sylv...@yakaz.com wrote: Hi, Maybe the following ticket/patch may be what you are looking for: https://issues.apache.org/jira/browse/CASSANDRA-699 It's flagged for 0.7 but as it breaks the API (and if I understand correctly the release plan) it may not make it in cassandra before 0.8 (and the patch will have to change to accommodate the change that will be made to the internals in 0.7). Anyway, what I can at least tell you is that I'm using the patch against 0.5 in a test cluster without problem so far. 3) Once keys are deleted, do you have to wait till next GC to clean them from disk or memory (suppose you don't run cleanup manually)? What's the strategy for Cassandra to handle deleted items (notify other replica nodes, cleanup memory/disk, defrag/rebuild disk files, rebuild bloom filter etc). I'm asking this because if the keys refresh very fast (i.e., high volume write/read and expiration is kind of short) how will the data file grow and how does this impact the system performance. Items are deleted only during compaction, and you may actually have to wait for the GCGraceSeconds before deletion. This value is configurable in storage-conf.xml, but is 10 days by default. You can decrease this value but because of consistency (and the fact that you have to at least wait for compaction to occurs) you will always have a delay before the actual delete (all this is also true for the patch I mention above by the way). But when it's deleted, it's just skipping the items during compaction, so it's really cheap. -- Sylvain
Strategy to delete/expire keys in cassandra
It seems that we are mostly talking about write and read keys into/from Cassandra cluster. I’m wondering how did you successfully deal with deleting/expiring keys in Cassandra? An typical example is you want to delete keys that haven’t been modified in certain time period (i.e., old keys). Here’s my thoughts: 1) If you use order preserve partition, you need to iterate through all keys, periodically, to check their last modified time to decide whether a key should be deleted. When you have hundreds million of keys with high write/read traffic, it will be very time and resource consuming to iterate all keys in all clusters. 2) If you use random partition, you’ll need to keep a list of ALL keys somewhere and keep it updated through the time, then go through it periodically to delete expired items. Again when you have hundreds million of keys, maintaining such a big dynamic key list with their expiration time is not trivial work. 3) Once keys are deleted, do you have to wait till next GC to clean them from disk or memory (suppose you don’t run cleanup manually)? What’s the strategy for Cassandra to handle deleted items (notify other replica nodes, cleanup memory/disk, defrag/rebuild disk files, rebuild bloom filter etc). I’m asking this because if the keys refresh very fast (i.e., high volume write/read and expiration is kind of short) how will the data file grow and how does this impact the system performance. So what’s your opinion to deal with the above cases to expire keys? I’m trying to decide whether we can use Cassandra for just high traffic read-only, write-only or both read and write. Thanks, -Weijun
Re: Strategy to delete/expire keys in cassandra
Thanks for the answer. A dumb question: how did you apply the patch file to 0.5 source? The link you gave doesn't mention that the patch is for 0.5?? Also, this ExpiringColumn feature doesn't seem to expire key/row, meaning the number of keys will keep grow (even if you drop columns for them) unless you delete them. In your case, how do you manage deleting/expiring keys from Cassandra? Do you keep a list of keys somewhere and go through them once a while? Thanks, -Weijun On Tue, Feb 23, 2010 at 2:26 AM, Sylvain Lebresne sylv...@yakaz.com wrote: Hi, Maybe the following ticket/patch may be what you are looking for: https://issues.apache.org/jira/browse/CASSANDRA-699 It's flagged for 0.7 but as it breaks the API (and if I understand correctly the release plan) it may not make it in cassandra before 0.8 (and the patch will have to change to accommodate the change that will be made to the internals in 0.7). Anyway, what I can at least tell you is that I'm using the patch against 0.5 in a test cluster without problem so far. 3) Once keys are deleted, do you have to wait till next GC to clean them from disk or memory (suppose you don’t run cleanup manually)? What’s the strategy for Cassandra to handle deleted items (notify other replica nodes, cleanup memory/disk, defrag/rebuild disk files, rebuild bloom filter etc). I’m asking this because if the keys refresh very fast (i.e., high volume write/read and expiration is kind of short) how will the data file grow and how does this impact the system performance. Items are deleted only during compaction, and you may actually have to wait for the GCGraceSeconds before deletion. This value is configurable in storage-conf.xml, but is 10 days by default. You can decrease this value but because of consistency (and the fact that you have to at least wait for compaction to occurs) you will always have a delay before the actual delete (all this is also true for the patch I mention above by the way). But when it's deleted, it's just skipping the items during compaction, so it's really cheap. -- Sylvain
Re: Testing row cache feature in trunk: write should put record in cache
I see. How much is the overhead of java serialization? Does it slow down the system a lot? It seems to be a tradeoff between CPU usage and memory. As for mmap of 0.6, do you mmap the sstable data file even it is a lot larger than the available memory (e.g., the data file is over 100GB while you have only 8GB ram)? How efficient is mmap in this case? Is mmap already checked into 0.6 branch? -Weijun On Fri, Feb 19, 2010 at 4:56 AM, Jonathan Ellis jbel...@gmail.com wrote: The whole point of rowcache is to avoid the serialization overhead, though. If we just wanted the serialized form cached, we would let the os block cache handle that without adding an extra layer. (0.6 uses mmap'd i/o by default on 64bit JVMs so this is very efficient.) On Fri, Feb 19, 2010 at 3:29 AM, Weijun Li weiju...@gmail.com wrote: The memory overhead issue is not directly related to GC because when JVM ran out of memory the GC has been very busy for quite a while. In my case JVM consumed all of the 6GB when the row cache size hit 1.4mil. I haven't started test the row cache feature yet. But I think data compression is useful to reduce memory consumption because in my impression disk i/o is always the bottleneck for Cassandra while its CPU usage is usually low all the time. In addition to this, compression should also help to reduce the number of java objects dramatically (correct me if I'm wrong), --especially in case we need to cache most of the data to achieve decent read latency. If ColumnFamily is serializable it shouldn't be that hard to implement the compression feature which can be controlled by an option (again :-) in storage conf xml. When I get to that point you can instruct me to implement this feature along with the row-cache-write-through. Our goal is straightforward: to support short read latency in high volume web application with write/read ratio to be 1:1. -Weijun -Original Message- From: Jonathan Ellis [mailto:jbel...@gmail.com] Sent: Thursday, February 18, 2010 12:04 PM To: cassandra-user@incubator.apache.org Subject: Re: Testing row cache feature in trunk: write should put record in cache Did you force a GC from jconsole to make sure you weren't just measuring uncollected garbage? On Wed, Feb 17, 2010 at 2:51 PM, Weijun Li weiju...@gmail.com wrote: OK I'll work on the change later because there's another problem to solve: the overhead for cache is too big that 1.4mil records (1k each) consumed all of the 6gb memory of JVM (I guess 4gb are consumed by the row cache). I'm thinking that ConcurrentHashMap is not a good choice for LRU and the row cache needs to store compressed key data to reduce memory usage. I'll do more investigation on this and let you know. -Weijun On Tue, Feb 16, 2010 at 9:22 PM, Jonathan Ellis jbel...@gmail.com wrote: ... tell you what, if you write the option-processing part in DatabaseDescriptor I will do the actual cache part. :) On Tue, Feb 16, 2010 at 11:07 PM, Jonathan Ellis jbel...@gmail.com wrote: https://issues.apache.org/jira/secure/CreateIssue!default.jspahttps://issues.apache.org/jira/secure/CreateIssue%21default.jspa, but this is pretty low priority for me. On Tue, Feb 16, 2010 at 8:37 PM, Weijun Li weiju...@gmail.com wrote: Just tried to make quick change to enable it but it didn't work out :-( ColumnFamily cachedRow = cfs.getRawCachedRow(mutation.key()); // What I modified if( cachedRow == null ) { cfs.cacheRow(mutation.key()); cachedRow = cfs.getRawCachedRow(mutation.key()); } if (cachedRow != null) cachedRow.addAll(columnFamily); How can I open a ticket for you to make the change (enable row cache write through with an option)? Thanks, -Weijun On Tue, Feb 16, 2010 at 5:20 PM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Feb 16, 2010 at 7:17 PM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Feb 16, 2010 at 7:11 PM, Weijun Li weiju...@gmail.com wrote: Just started to play with the row cache feature in trunk: it seems to be working fine so far except that for RowsCached parameter you need to specify number of rows rather than a percentage (e.g., 20% doesn't work). 20% works, but it's 20% of the rows at server startup. So on a fresh start that is zero. Maybe we should just get rid of the % feature... (Actually, it shouldn't be hard to update this on flush, if you want to open a ticket.)
Unbalanced read latency among nodes in a cluster
I setup a two cassandra clusters with 2 nodes each. Both use random partitioner. It's strange that for each cluster, one node has much shortter read latency than the other one This is the info of one of the cluster: Node A: read count 77302, data file 41GB, read latency 58180, io saturation 100% Node B: read count 488753, data file 26GB, read latency 5822 , io saturation 35%. I first started node A, then ran B to join the cluster. Both machines have exactly the same hardware and OS. The test client randomly pick a node to write and it worked fine for the other cluster. Address Status Load Range Ring 169400792707028208569145873749456918214 10.xxx Up 38.39 GB 103633195217832666843316719920043079797|--| 10.xxx Up 24.22 GB 169400792707028208569145873749456918214|--| For both clusters, whichever node that took more reads (with larger data file) owns the much worse read latency. What's the algorithm that cassandra use to split token when a new node is joining? What could cause this unbalanced read latency issue? How can I fix this? How to make sure all nodes get evenly distributed data and traffic? -Weijun
Re: Testing row cache feature in trunk: write should put record in cache
OK I'll work on the change later because there's another problem to solve: the overhead for cache is too big that 1.4mil records (1k each) consumed all of the 6gb memory of JVM (I guess 4gb are consumed by the row cache). I'm thinking that ConcurrentHashMap is not a good choice for LRU and the row cache needs to store compressed key data to reduce memory usage. I'll do more investigation on this and let you know. -Weijun On Tue, Feb 16, 2010 at 9:22 PM, Jonathan Ellis jbel...@gmail.com wrote: ... tell you what, if you write the option-processing part in DatabaseDescriptor I will do the actual cache part. :) On Tue, Feb 16, 2010 at 11:07 PM, Jonathan Ellis jbel...@gmail.com wrote: https://issues.apache.org/jira/secure/CreateIssue!default.jspahttps://issues.apache.org/jira/secure/CreateIssue%21default.jspa, but this is pretty low priority for me. On Tue, Feb 16, 2010 at 8:37 PM, Weijun Li weiju...@gmail.com wrote: Just tried to make quick change to enable it but it didn't work out :-( ColumnFamily cachedRow = cfs.getRawCachedRow(mutation.key()); // What I modified if( cachedRow == null ) { cfs.cacheRow(mutation.key()); cachedRow = cfs.getRawCachedRow(mutation.key()); } if (cachedRow != null) cachedRow.addAll(columnFamily); How can I open a ticket for you to make the change (enable row cache write through with an option)? Thanks, -Weijun On Tue, Feb 16, 2010 at 5:20 PM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Feb 16, 2010 at 7:17 PM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Feb 16, 2010 at 7:11 PM, Weijun Li weiju...@gmail.com wrote: Just started to play with the row cache feature in trunk: it seems to be working fine so far except that for RowsCached parameter you need to specify number of rows rather than a percentage (e.g., 20% doesn't work). 20% works, but it's 20% of the rows at server startup. So on a fresh start that is zero. Maybe we should just get rid of the % feature... (Actually, it shouldn't be hard to update this on flush, if you want to open a ticket.)
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
Dumped 50mil records into my 2-node cluster overnight, made sure that there's not many data files (around 30 only) per Martin's suggestion. The size of the data directory is 63GB. Now when I read records from the cluster the read latency is still ~44ms, --there's no write happening during the read. And iostats shows that the disk (RAID10, 4 250GB 15k SAS) is saturated: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 47.6767.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 47.6767.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17 sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 CPU usage is low. Does this mean disk i/o is the bottleneck for my case? Will it help if I increase KCF to cache all sstable index? Also, this is the almost a read-only mode test, and in reality, our write/read ratio is close to 1:1 so I'm guessing read latency will even go higher in that case because there will be difficult for cassandra to find a good moment to compact the data files that are being busy written. Thanks, -Weijun On Tue, Feb 16, 2010 at 6:06 AM, Brandon Williams dri...@gmail.com wrote: On Tue, Feb 16, 2010 at 2:32 AM, Dr. Martin Grabmüller martin.grabmuel...@eleven.de wrote: In my tests I have observed that good read latency depends on keeping the number of data files low. In my current test setup, I have stored 1.9 TB of data on a single node, which is in 21 data files, and read latency is between 10 and 60ms (for small reads, larger read of course take more time). In earlier stages of my test, I had up to 5000 data files, and read performance was quite bad: my configured 10-second RPC timeout was regularly encountered. I believe it is known that crossing sstables is O(NlogN) but I'm unable to find the ticket on this at the moment. Perhaps Stu Hood will jump in and enlighten me, but in any case I believe https://issues.apache.org/jira/browse/CASSANDRA-674 will eventually solve it. Keeping write volume low enough that compaction can keep up is one solution, and throwing hardware at the problem is another, if necessary. Also, the row caching in trunk (soon to be 0.6 we hope) helps greatly for repeat hits. -Brandon
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
One more thoughts about Martin's suggestion: is it possible to put the data files into multiple directories that are located in different physical disks? This should help to improve the i/o bottleneck issue. Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? -Weijun On Tue, Feb 16, 2010 at 9:50 AM, Weijun Li weiju...@gmail.com wrote: Dumped 50mil records into my 2-node cluster overnight, made sure that there's not many data files (around 30 only) per Martin's suggestion. The size of the data directory is 63GB. Now when I read records from the cluster the read latency is still ~44ms, --there's no write happening during the read. And iostats shows that the disk (RAID10, 4 250GB 15k SAS) is saturated: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 47.6767.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 47.6767.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17 sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 CPU usage is low. Does this mean disk i/o is the bottleneck for my case? Will it help if I increase KCF to cache all sstable index? Also, this is the almost a read-only mode test, and in reality, our write/read ratio is close to 1:1 so I'm guessing read latency will even go higher in that case because there will be difficult for cassandra to find a good moment to compact the data files that are being busy written. Thanks, -Weijun On Tue, Feb 16, 2010 at 6:06 AM, Brandon Williams dri...@gmail.comwrote: On Tue, Feb 16, 2010 at 2:32 AM, Dr. Martin Grabmüller martin.grabmuel...@eleven.de wrote: In my tests I have observed that good read latency depends on keeping the number of data files low. In my current test setup, I have stored 1.9 TB of data on a single node, which is in 21 data files, and read latency is between 10 and 60ms (for small reads, larger read of course take more time). In earlier stages of my test, I had up to 5000 data files, and read performance was quite bad: my configured 10-second RPC timeout was regularly encountered. I believe it is known that crossing sstables is O(NlogN) but I'm unable to find the ticket on this at the moment. Perhaps Stu Hood will jump in and enlighten me, but in any case I believe https://issues.apache.org/jira/browse/CASSANDRA-674 will eventually solve it. Keeping write volume low enough that compaction can keep up is one solution, and throwing hardware at the problem is another, if necessary. Also, the row caching in trunk (soon to be 0.6 we hope) helps greatly for repeat hits. -Brandon
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
Thanks for for DataFileDirectory trick and I'll give a try. Just noticed the impact of number of data files: node A has 13 data files with read latency of 20ms and node B has 27 files with read latency of 60ms. After I ran nodeprobe compact on node B its read latency went up to 150ms. The read latency of node A became as low as 10ms. Is this normal behavior? I'm using random partitioner and the hardware/JVM settings are exactly the same for these two nodes. Another problem is that Java heap usage is always 900mb out of 6GB? Is there any way to utilize all of the heap space to decrease the read latency? -Weijun On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams dri...@gmail.com wrote: On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote: One more thoughts about Martin's suggestion: is it possible to put the data files into multiple directories that are located in different physical disks? This should help to improve the i/o bottleneck issue. Yes, you can already do this, just add more DataFileDirectory directives pointed at multiple drives. Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? Row cache and key cache both help tremendously if your read pattern has a decent repeat rate. Completely random io can only be so fast, however. -Brandon
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
Still have high read latency with 50mil records in the 2-node cluster (replica 2). I restarted both nodes but read latency is still above 60ms and disk i/o saturation is high. Tried compact and repair but doesn't help much. When I reduced the client threads from 15 to 5 it looks a lot better but throughput is kind of low. I changed using flushing thread of 16 instead the defaulted 8, could that cause the disk saturation issue? For benchmark with decent throughput and latency, how many client threads do they use? Can anyone share your storage-conf.xml in well-tuned high volume cluster? -Weijun On Tue, Feb 16, 2010 at 10:31 AM, Stu Hood stu.h...@rackspace.com wrote: After I ran nodeprobe compact on node B its read latency went up to 150ms. The compaction process can take a while to finish... in 0.5 you need to watch the logs to figure out when it has actually finished, and then you should start seeing the improvement in read latency. Is there any way to utilize all of the heap space to decrease the read latency? In 0.5 you can adjust the number of keys that are cached by changing the 'KeysCachedFraction' parameter in your config file. In 0.6 you can additionally cache rows. You don't want to use up all of the memory on your box for those caches though: you'll want to leave at least 50% for your OS's disk cache, which will store the full row content. -Original Message- From: Weijun Li weiju...@gmail.com Sent: Tuesday, February 16, 2010 12:16pm To: cassandra-user@incubator.apache.org Subject: Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)? Thanks for for DataFileDirectory trick and I'll give a try. Just noticed the impact of number of data files: node A has 13 data files with read latency of 20ms and node B has 27 files with read latency of 60ms. After I ran nodeprobe compact on node B its read latency went up to 150ms. The read latency of node A became as low as 10ms. Is this normal behavior? I'm using random partitioner and the hardware/JVM settings are exactly the same for these two nodes. Another problem is that Java heap usage is always 900mb out of 6GB? Is there any way to utilize all of the heap space to decrease the read latency? -Weijun On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams dri...@gmail.com wrote: On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote: One more thoughts about Martin's suggestion: is it possible to put the data files into multiple directories that are located in different physical disks? This should help to improve the i/o bottleneck issue. Yes, you can already do this, just add more DataFileDirectory directives pointed at multiple drives. Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? Row cache and key cache both help tremendously if your read pattern has a decent repeat rate. Completely random io can only be so fast, however. -Brandon
Testing row cache feature in trunk: write should put record in cache
Just started to play with the row cache feature in trunk: it seems to be working fine so far except that for RowsCached parameter you need to specify number of rows rather than a percentage (e.g., 20% doesn't work). Thanks for this great feature that improves read latency dramatically so that disk i/o is no longer a serious bottleneck. The problem is: when you write to Cassandra it doesn't seem to put the new keys in row cache (it is said to update instead invalidate if the entry is already in cache). Is it easy to implement this feature? What are the classes that should be touched for this? I'm guessing that RowMutationVerbHandler should be the one to insert the entry in row cache? -Weijun
Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
Yes my KeysCachedFraction is already 0.3 but it doesn't relief the disk i/o. I compacted the data to be a 60GB (took quite a while to finish and it increased latency as expected) one but doesn't help much either. If I set KCF to 1 (meaning to cache all sstable index), how much memory will it take for 50mil keys? Is the index a straight key-offset map? I guess key is 16 bytes and offset is 8 bytes. Will KCF=1 help to reduce disk i/o? -Weijun On Tue, Feb 16, 2010 at 5:18 PM, Jonathan Ellis jbel...@gmail.com wrote: Have you tried increasing KeysCachedFraction? On Tue, Feb 16, 2010 at 6:15 PM, Weijun Li weiju...@gmail.com wrote: Still have high read latency with 50mil records in the 2-node cluster (replica 2). I restarted both nodes but read latency is still above 60ms and disk i/o saturation is high. Tried compact and repair but doesn't help much. When I reduced the client threads from 15 to 5 it looks a lot better but throughput is kind of low. I changed using flushing thread of 16 instead the defaulted 8, could that cause the disk saturation issue? For benchmark with decent throughput and latency, how many client threads do they use? Can anyone share your storage-conf.xml in well-tuned high volume cluster? -Weijun On Tue, Feb 16, 2010 at 10:31 AM, Stu Hood stu.h...@rackspace.com wrote: After I ran nodeprobe compact on node B its read latency went up to 150ms. The compaction process can take a while to finish... in 0.5 you need to watch the logs to figure out when it has actually finished, and then you should start seeing the improvement in read latency. Is there any way to utilize all of the heap space to decrease the read latency? In 0.5 you can adjust the number of keys that are cached by changing the 'KeysCachedFraction' parameter in your config file. In 0.6 you can additionally cache rows. You don't want to use up all of the memory on your box for those caches though: you'll want to leave at least 50% for your OS's disk cache, which will store the full row content. -Original Message- From: Weijun Li weiju...@gmail.com Sent: Tuesday, February 16, 2010 12:16pm To: cassandra-user@incubator.apache.org Subject: Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)? Thanks for for DataFileDirectory trick and I'll give a try. Just noticed the impact of number of data files: node A has 13 data files with read latency of 20ms and node B has 27 files with read latency of 60ms. After I ran nodeprobe compact on node B its read latency went up to 150ms. The read latency of node A became as low as 10ms. Is this normal behavior? I'm using random partitioner and the hardware/JVM settings are exactly the same for these two nodes. Another problem is that Java heap usage is always 900mb out of 6GB? Is there any way to utilize all of the heap space to decrease the read latency? -Weijun On Tue, Feb 16, 2010 at 10:01 AM, Brandon Williams dri...@gmail.com wrote: On Tue, Feb 16, 2010 at 11:56 AM, Weijun Li weiju...@gmail.com wrote: One more thoughts about Martin's suggestion: is it possible to put the data files into multiple directories that are located in different physical disks? This should help to improve the i/o bottleneck issue. Yes, you can already do this, just add more DataFileDirectory directives pointed at multiple drives. Has anybody tested the row-caching feature in trunk (shoot for 0.6?)? Row cache and key cache both help tremendously if your read pattern has a decent repeat rate. Completely random io can only be so fast, however. -Brandon
Re: Testing row cache feature in trunk: write should put record in cache
Just tried to make quick change to enable it but it didn't work out :-( ColumnFamily cachedRow = cfs.getRawCachedRow(mutation.key()); // What I modified if( cachedRow == null ) { cfs.cacheRow(mutation.key()); cachedRow = cfs.getRawCachedRow(mutation.key()); } if (cachedRow != null) cachedRow.addAll(columnFamily); How can I open a ticket for you to make the change (enable row cache write through with an option)? Thanks, -Weijun On Tue, Feb 16, 2010 at 5:20 PM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Feb 16, 2010 at 7:17 PM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Feb 16, 2010 at 7:11 PM, Weijun Li weiju...@gmail.com wrote: Just started to play with the row cache feature in trunk: it seems to be working fine so far except that for RowsCached parameter you need to specify number of rows rather than a percentage (e.g., 20% doesn't work). 20% works, but it's 20% of the rows at server startup. So on a fresh start that is zero. Maybe we should just get rid of the % feature... (Actually, it shouldn't be hard to update this on flush, if you want to open a ticket.)
RE: Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
It seems that read latency is sensitive to number of threads (or thrift clients): after reducing number of threads to 15 and read latency decreased to ~20ms. The other problem is: if I keep mixed write and read (e.g, 8 write threads plus 7 read threads) against the 2-nodes cluster continuously, the read latency will go up gradually (along with the size of Cassandra data file), and at the end it will become ~40ms (up from ~20ms) even with only 15 threads. During this process the data file grew from 1.6GB to over 3GB even if I kept writing the same key/values to Cassandra. It seems that Cassandra keeps appending to sstable data files and will only clean up them during node cleanup or compact (please correct me if this is incorrect). Here's my test settings: JVM xmx: 6GB KCF: 0.3 Memtable: 512MB. Number of records: 1 millon (payload is 1000 bytes) I used JMX and iostat to watch the cluster but can't find any clue for the increasing read latency issue: JVM memory, GC, CPU usage, tpstats and io saturation all seem to be clean. One exception is that the wait time in iostat goes up quickly once a while but is a small number for most of the time. Another thing I noticed is that JVM doesn't use more than 1GB of memory (out of the 6GB I specified for JVM) even if I set KCF to 0.3 and increased memtable size to 512MB. Did I miss anything here? How can I diagnose this kind of increasing read latency issue? Is there any performance tuning guide available? Thanks, -Weijun -Original Message- From: Jonathan Ellis [mailto:jbel...@gmail.com] Sent: Sunday, February 14, 2010 6:22 PM To: cassandra-user@incubator.apache.org Subject: Re: Cassandra benchmark shows OK throughput but high read latency ( 100ms)? are you i/o bound? what is your on-disk data set size? what does iostats tell you? http://spyced.blogspot.com/2010/01/linux-performance-basics.html do you have a lot of pending compactions? (tpstats will tell you) have you increased KeysCachedFraction? On Sun, Feb 14, 2010 at 8:18 PM, Weijun Li weiju...@gmail.com wrote: Hello, I saw some Cassandra benchmark reports mentioning read latency that is less than 50ms or even 30ms. But my benchmark with 0.5 doesn't seem to support that. Here's my settings: Nodes: 2 machines. 2x2.5GHZ Xeon Quad Core (thus 8 cores), 8GB RAM ReplicationFactor=2 Partitioner=Random JVM Xmx: 4GB Memory table size: 512MB (haven't figured out how to enable binary memtable so I set both memtable number to 512mb) Flushing threads: 2-4 Payload: ~1000 bytes, 3 columns in one CF. Read/write time measure: get startTime right before each Java thrift call, transport objects are pre-created upon creation of each thread. The result shows that total write throughput is around 2000/sec (for 2 nodes in the cluster) which is not bad, and read throughput is just around 750/sec. However for each thread the average read latency is more than 100ms. I'm running 100 threads for the testing and each thread randomly pick a node for thrift call. So the read/sec of each thread is just around 7.5, meaning duration of each thrift call is 1000/7.5=133ms. Without replication the cluster write throughput is around 3300/s, and read throughput is around 1400/s, so the read latency is still around 70ms without replication. Is there anything wrong in my benchmark test? How can I achieve a reasonable read latency ( 30ms)? Thanks, -Weijun
Cassandra benchmark shows OK throughput but high read latency ( 100ms)?
Hello, I saw some Cassandra benchmark reports mentioning read latency that is less than 50ms or even 30ms. But my benchmark with 0.5 doesn’t seem to support that. Here’s my settings: Nodes: 2 machines. 2x2.5GHZ Xeon Quad Core (thus 8 cores), 8GB RAM ReplicationFactor=2 Partitioner=Random JVM Xmx: 4GB Memory table size: 512MB (haven’t figured out how to enable binary memtable so I set both memtable number to 512mb) Flushing threads: 2-4 Payload: ~1000 bytes, 3 columns in one CF. Read/write time measure: get startTime right before each Java thrift call, transport objects are pre-created upon creation of each thread. The result shows that total write throughput is around 2000/sec (for 2 nodes in the cluster) which is not bad, and read throughput is just around 750/sec. However for each thread the average read latency is more than 100ms. I’m running 100 threads for the testing and each thread randomly pick a node for thrift call. So the read/sec of each thread is just around 7.5, meaning duration of each thrift call is 1000/7.5=133ms. Without replication the cluster write throughput is around 3300/s, and read throughput is around 1400/s, so the read latency is still around 70ms without replication. Is there anything wrong in my benchmark test? How can I achieve a reasonable read latency ( 30ms)? Thanks, -Weijun
RackAwareStrategy - add the third datacenter to live cluster with replication factor 3
Hello, I have a testing cluster with: A (dc1), B (dc1), C(dc2), D(dc2). The replication factor is 2 so I assume each DC will have a complete copy of the data. Also I'm using PropertyFileEndPointSnitch with rack.properties for the dc and rack settings. So, what's the steps to add another datacenter and increase replication factor to 3 to ensure that dc3 will also get a complete copy of the data? Meaning each of these 3 dc will have a complete copy of the data and they keep synchronize with each other with new changes. What I'm guessing is: 1) Increase replication factor of A/B/C/D to 3, modify their rack.properties to include E(dc3) and F(dc3) then restart them one by one. At this point E and F haven't been started yet. 2) Bootstrap E and F (both from dc3) to join the cluster. in this case, will cassandra automatically put the 3rd replica to E and F? Thanks, -Weijun P.S here what the cassandra document says about dc replication but I'm not sure what will happen when you join nodes from the 3rd dc. - RackAwareStrategy: replica 2 is is placed in the first node along the ring the belongs in *another* data center than the first; the remaining N-2 replicas, if any, are placed on the first nodes along the ring in the *same* rack as the first
nodeprobe flush not implemented in 0.5?
Hello, I tried to run nodeprobe flush but it display the usage info without doing anything? What are the list of supported command for nodeprobe? Thanks, -Weijun
Rebalance after adding new nodes
When you add a new node, cassandra will pick the node that has the most data then split its token. In this case the data distribution among all nodes become uneven. What's the right strategy/steps to rebalance the node load after adding new nodes? Here's one example: I have a cluster of node A, B, C, D. Now I want to add E and F, after adding the nodes, the data distribution will change from 1/1/1/1 to 1/1/0.5/0.5/0.5/0.5, is this correct? Thanks, -Weijun
nodeprobe freezes when connecting to remote cassandra node
Hello, got one more issue when I was trying to run nodeprobe to connect to a remote cassandra node, it freezed for a while then showed the following error. The jmxremote port 8080 is open, and I tried to change the port but it doesn't help. This command works properly if I run it in the same machine as the node (thus localhost). Thanks, -Weijun bin/nodeprobe --host [hostname] --port 8080 ring Error connecting to remote JMX agent! java.rmi.ConnectException: Connection refused to host: 10.xxx.xxx.xxx; nested exception is: java.net.ConnectException: Operation timed out at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:601) at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:198) at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:110) at javax.management.remote.rmi.RMIServerImpl_Stub.newClient(Unknown Source) at javax.management.remote.rmi.RMIConnector.getConnection(RMIConnector.java:2327) at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:279) at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248) at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:153) at org.apache.cassandra.tools.NodeProbe.init(NodeProbe.java:115) at org.apache.cassandra.tools.NodeProbe.main(NodeProbe.java:514) Caused by: java.net.ConnectException: Operation timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:432) at java.net.Socket.connect(Socket.java:525) at java.net.Socket.connect(Socket.java:475) at java.net.Socket.init(Socket.java:372) at java.net.Socket.init(Socket.java:186) at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:22) at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:128) at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:595) ... 10 more