high level of MemtablePostFlusher pending events
5 node cluster running 1.0.2, doing about 1300 reads and 1300 writes/sec into 3 column families in the same keyspace. 2 client machines, doing about the same amount of reads/writes, but one has an average response time in the 4-40ms range and the other in the 200-800ms range. Both running identical software, homebrew with hector-1.0-3 client. Traffic was peaking out at 6k reads and 6k writes/sec, according to reporting from our software, and now it's topping out at 1300/sec each. The cpus on the cassy boxes are bored. None of the threads within cassandra are chewing more than 3% cpu. Disk is only 10% full on the most loaded box. MemtablePostFlusher 1 102 36 Not all servers have the same number of pending tasks. They have 0, 1, 17, 37, and 105. It looks like it's stuck and not recovering, cuz it's been like this for an hour. I've attached the end of the cassandra.log from the server with the most pending tasks. There are some interesting exceptions in there. As always, all help is always appreciated! :p cassandra.log Description: Binary data
Re: Replicate On Write behavior
They are evenly distributed. 5 nodes * 40 connections each using hector, and I can confirm that all 200 are active when this happened (from hector's perspective, from graphing the hector jmx data), and all 5 nodes saw roughly 40 connections, and all were receiving traffic over those connections. (netstat + ntop + trafshow, etc) I can also confirm that I changed my insert strategy to break up the rows using composite row keys, which reduced the row lengths and gave me an almost perfectly even data distribution among the nodes, and that was when I started to really dig into why the ROWs were still backing up on one node specifically, and why 2 nodes weren't seeing any. It was the 20%, 20%, 60% ROW distribution that really got me thinking, and when I took the 60% node out of the cluster, that ROW load jumped back to the node with the next-lowest IP address, and the 2 nodes that weren't seeing any *still* wheren't seeing any ROWs. At that point I tore down the cluster, recreated it as a 3 node cluster several times using various permutations of the 5 nodes available, and ROW load was *always* on the node with the lowest IP address. the theory might not be right, but it certainly represents the behavior I saw. On Sep 9, 2011, at 12:17 AM, Sylvain Lebresne wrote: We'll solve #2890 and we should have done it sooner. That being said, a quick question: how do you do your inserts from the clients ? Are you evenly distributing the inserts among the nodes ? Or are you always hitting the same coordinator ? Because provided the nodes are correctly distributed on the ring, if you distribute the inserts (increment) requests across the nodes (again I'm talking of client requests), you should not see the behavior you observe. -- Sylvain On Thu, Sep 8, 2011 at 9:48 PM, David Hawthorne dha...@gmx.3crowd.com wrote: It was exactly due to 2890, and the fact that the first replica is always the one with the lowest value IP address. I patched cassandra to pick a random node out of the replica set in StorageProxy.java findSuitableEndpoint: Random rng = new Random(); return endpoints.get(rng.nextInt(endpoints.size())); // instead of return endpoints.get(0); Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the inserts/sec throughput. Here's the behavior I saw, and disk work refers to the ReplicateOnWrite load of a counter insert: One node will get RF/n of the disk work. Two nodes will always get 0 disk work. in a 3 node cluster, 1 node gets disk hit really hard. You get the performance of a one-node cluster. in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you the performance of ~2 node cluster. in a 10 node cluster, 1 node gets 30% of the disk work, giving you the performance of a ~3 node cluster. I confirmed this behavior with a 3, 4, and 5 node cluster size. On another note, on a 5-node cluster, I'm only seeing 3 nodes with ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that normal? I'm using RandomPartitioner... Address DC RackStatus State Load OwnsToken 136112946768375385385349842972707284580 10.0.0.57datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 10.0.0.56datacenter1 rack1 Up Normal 2.47 GB 20.00% 34028236692093846346337460743176821145 10.0.0.55datacenter1 rack1 Up Normal 2.52 GB 20.00% 68056473384187692692674921486353642290 10.0.0.54datacenter1 rack1 Up Normal 950.97 MB 20.00% 102084710076281539039012382229530463435 10.0.0.72datacenter1 rack1 Up Normal 383.25 MB 20.00% 136112946768375385385349842972707284580 The nodes with ReplicateOnWrites are the 3 in the middle. The first node and last node both have a count of 0. This is a clean cluster, and I've been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 hours. The last time this test ran, it went all the way down to 500 inserts/sec before I killed it. Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. -- Sylvain
Re: Disable hector stats
I do it with a log4j properties file: log4j.appender.null=org.apache.log4j.varia.NullAppender log4j.category.me.prettyprint.cassandra.hector.TimingLogger=INFO, null log4j.additivity.me.prettyprint.cassandra.hector.TimingLogger=false On Sep 9, 2011, at 2:07 PM, Daning wrote: Hi, How to disable hector stats? We keep getting this in log (PeriodicalLog.java:221) INFO Thread-53040 2011-09-09 13:24:03,290 Statistics from Fri Sep 09 13:23:03 PDT 2011 to Fri Sep 09 13:24:03 PDT 2011 (PeriodicalLog.java:221) INFO Thread-53040 2011-09-09 13:24:03,291 Tag Avg(ms) Min Max Std Dev 95th Count (PeriodicalLog.java:221) INFO Thread-53040 2011-09-09 13:24:03,291 ( We have tried to set log4j like this but that does not work, log4j.logger.com.ecyrd.speed4j.log.PeriodicalLog=ERROR Thanks, Daning
Re: Replicate On Write behavior
It was exactly due to 2890, and the fact that the first replica is always the one with the lowest value IP address. I patched cassandra to pick a random node out of the replica set in StorageProxy.java findSuitableEndpoint: Random rng = new Random(); return endpoints.get(rng.nextInt(endpoints.size())); // instead of return endpoints.get(0); Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the inserts/sec throughput. Here's the behavior I saw, and disk work refers to the ReplicateOnWrite load of a counter insert: One node will get RF/n of the disk work. Two nodes will always get 0 disk work. in a 3 node cluster, 1 node gets disk hit really hard. You get the performance of a one-node cluster. in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you the performance of ~2 node cluster. in a 10 node cluster, 1 node gets 30% of the disk work, giving you the performance of a ~3 node cluster. I confirmed this behavior with a 3, 4, and 5 node cluster size. On another note, on a 5-node cluster, I'm only seeing 3 nodes with ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that normal? I'm using RandomPartitioner... Address DC RackStatus State LoadOwns Token 136112946768375385385349842972707284580 10.0.0.57datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 10.0.0.56datacenter1 rack1 Up Normal 2.47 GB 20.00% 34028236692093846346337460743176821145 10.0.0.55datacenter1 rack1 Up Normal 2.52 GB 20.00% 68056473384187692692674921486353642290 10.0.0.54datacenter1 rack1 Up Normal 950.97 MB 20.00% 102084710076281539039012382229530463435 10.0.0.72datacenter1 rack1 Up Normal 383.25 MB 20.00% 136112946768375385385349842972707284580 The nodes with ReplicateOnWrites are the 3 in the middle. The first node and last node both have a count of 0. This is a clean cluster, and I've been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 hours. The last time this test ran, it went all the way down to 500 inserts/sec before I killed it. Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. -- Sylvain
Solved: NoSuchMethodError with google guava/collections starting embedded cassandra service
I ran into this problem today. It's common enough that it shows up in google, but not common enough to have a documented resolution, so here's one. [junit] com.google.common.collect.ImmutableSet.copyOf(Ljava/util/Collection;)Lcom/google/common/collect/ImmutableSet; [junit] java.lang.NoSuchMethodError: com.google.common.collect.ImmutableSet.copyOf(Ljava/util/Collection;)Lcom/google/common/collect/ImmutableSet; [junit] at org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:479) [junit] at org.apache.cassandra.db.DataTracker.replace(DataTracker.java:248) [junit] at org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:219) [junit] at org.apache.cassandra.db.ColumnFamilyStore.init(ColumnFamilyStore.java:294) [junit] at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:466) [junit] at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:436) [junit] at org.apache.cassandra.db.Table.initCf(Table.java:369) [junit] at org.apache.cassandra.db.Table.init(Table.java:306) [junit] at org.apache.cassandra.db.Table.open(Table.java:111) [junit] at org.apache.cassandra.db.SystemTable.checkHealth(SystemTable.java:212) [junit] at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:149) [junit] at org.apache.cassandra.service.AbstractCassandraDaemon.init(AbstractCassandraDaemon.java:237) [junit] at org.apache.cassandra.service.EmbeddedCassandraService.start(EmbeddedCassandraService.java:57) Resolution: I was creating a monolithic jar file that takes the contents of all component jar files and puts them into one big jar, to simplify deployment of the cassandra client I'm writing. One of the component jars was also a monolithic jar and had the contents of google-collections in it. guava is a superset of google collections, and that google-collections contents was overwriting the expansion of the guava jar. Removing the component jar file (which was an option in this case) solved the problem.
Re: Replicate On Write behavior
That's interesting. I did an experiment wherein I added some entropy to the row name based on the time when the increment came in, (e.g. row = row + / + (timestamp - (timestamp % 300))) and now not only is the load (in GB) on my cluster more balanced, the performance has not decayed and has stayed steady (inserts/sec) with a relatively low average ms/insert. Each row is now significantly shorter as a result of this change. On Sep 2, 2011, at 12:30 AM, Sylvain Lebresne wrote: On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne dha...@gmx.3crowd.com wrote: I'm curious... digging through the source, it looks like replicate on write triggers a read of the entire row, and not just the columns/supercolumns that are affected by the counter update. Is this the case? It would certainly explain why my inserts/sec decay over time and why the average insert latency increases over time. The strange thing is that I'm not seeing disk read IO increase over that same period, but that might be due to the OS buffer cache... It does not. It only reads the columns/supercolumns affected by the counter update. In the source, this happens in CounterMutation.java. If you look at addReadCommandFromColumnFamily you'll see that it does a query by name only for the column involved in the update (the update is basically the content of the columnFamily parameter there). And Cassandra does *not* always reads a full row. Never had, never will. On another note, on a 5-node cluster, I'm only seeing 3 nodes with ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that normal? I'm using RandomPartitioner... Address DC RackStatus State LoadOwns Token 136112946768375385385349842972707284580 10.0.0.57datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 10.0.0.56datacenter1 rack1 Up Normal 2.47 GB 20.00% 34028236692093846346337460743176821145 10.0.0.55datacenter1 rack1 Up Normal 2.52 GB 20.00% 68056473384187692692674921486353642290 10.0.0.54datacenter1 rack1 Up Normal 950.97 MB 20.00% 102084710076281539039012382229530463435 10.0.0.72datacenter1 rack1 Up Normal 383.25 MB 20.00% 136112946768375385385349842972707284580 The nodes with ReplicateOnWrites are the 3 in the middle. The first node and last node both have a count of 0. This is a clean cluster, and I've been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 hours. The last time this test ran, it went all the way down to 500 inserts/sec before I killed it. Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. -- Sylvain
JMX TotalReadLatencyMicros sanity check
I've graphed the rate of change of the TotalReadLatencyMicros counter over the last 12 hours, and divided by 1,000,000 to get it in seconds. I'm grabbing it every 10 seconds, so I divided by another 10 to get per-second rates. The result is that I have a CF doing 10 seconds of read *every second*. Does that make sense? If I divide it by the number of reads done, it matches up with the latency I'm seeing from cfstats: 1.5ms/read.
Re: Replicate On Write behavior
Does it always pick the node with the lowest IP address? All of my hosts are in the same /24. The fourth node in the 5 node cluster has the lowest value in the 4th octet (54). I erased the cluster and rebuilt it from scratch as a 3 node cluster using the first 3 nodes, and now the ReplicateOnWrites are all going to the third node, which is also the lowest valued IP address (55). That would explain why only 1 node gets writes in a 3 node cluster (RF=3) and why 3 nodes get writes in a 5 node cluster, and why one of those 3 is taking 66% of the writes. On another note, on a 5-node cluster, I'm only seeing 3 nodes with ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that normal? I'm using RandomPartitioner... Address DC RackStatus State LoadOwns Token 136112946768375385385349842972707284580 10.0.0.57datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 10.0.0.56datacenter1 rack1 Up Normal 2.47 GB 20.00% 34028236692093846346337460743176821145 10.0.0.55datacenter1 rack1 Up Normal 2.52 GB 20.00% 68056473384187692692674921486353642290 10.0.0.54datacenter1 rack1 Up Normal 950.97 MB 20.00% 102084710076281539039012382229530463435 10.0.0.72datacenter1 rack1 Up Normal 383.25 MB 20.00% 136112946768375385385349842972707284580 The nodes with ReplicateOnWrites are the 3 in the middle. The first node and last node both have a count of 0. This is a clean cluster, and I've been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 hours. The last time this test ran, it went all the way down to 500 inserts/sec before I killed it. Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. -- Sylvain
Replicate On Write behavior
I'm curious... digging through the source, it looks like replicate on write triggers a read of the entire row, and not just the columns/supercolumns that are affected by the counter update. Is this the case? It would certainly explain why my inserts/sec decay over time and why the average insert latency increases over time. The strange thing is that I'm not seeing disk read IO increase over that same period, but that might be due to the OS buffer cache... On another note, on a 5-node cluster, I'm only seeing 3 nodes with ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that normal? I'm using RandomPartitioner... Address DC RackStatus State LoadOwns Token 136112946768375385385349842972707284580 10.0.0.57datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 10.0.0.56datacenter1 rack1 Up Normal 2.47 GB 20.00% 34028236692093846346337460743176821145 10.0.0.55datacenter1 rack1 Up Normal 2.52 GB 20.00% 68056473384187692692674921486353642290 10.0.0.54datacenter1 rack1 Up Normal 950.97 MB 20.00% 102084710076281539039012382229530463435 10.0.0.72datacenter1 rack1 Up Normal 383.25 MB 20.00% 136112946768375385385349842972707284580 The nodes with ReplicateOnWrites are the 3 in the middle. The first node and last node both have a count of 0. This is a clean cluster, and I've been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 hours. The last time this test ran, it went all the way down to 500 inserts/sec before I killed it.
Re: nodetool tpstats feature request
watch doesn't calculate diffs On Aug 31, 2011, at 2:29 PM, Peter Sanford wrote: I use `watch` to do this: watch -n 5 nodetool -h localhost tpstats -psanford On Wed, Aug 31, 2011 at 1:59 PM, David Hawthorne dha...@gmx.3crowd.com wrote: It would be very useful to be able to get refreshing statistics from tpstats, a la top. nodetool -h localhost tpstats [n] refresh every second, show me the Active and Pending and Blocked columns as they currently exist, but under Completed show me a per-second rate based on the delta from the previous counter and the number of seconds (n) specified on the command line. nodetool -h localhost tpstats 1 to update every second nodetool -h localhost tpstats 5 to update every 5 seconds, etc
randompartitioner cluster unbalanced
$ ./nodetool -h localhost ring Address DC RackStatus State LoadOwns Token 136112946768375385385349842972707284580 10.0.0.57datacenter1 rack1 Up Normal 8.31 GB 20.00% 0 10.0.0.56datacenter1 rack1 Up Normal 13.7 GB 20.00% 34028236692093846346337460743176821145 10.0.0.55datacenter1 rack1 Up Normal 13.87 GB20.00% 68056473384187692692674921486353642290 10.0.0.54datacenter1 rack1 Up Normal 8.03 GB 20.00% 102084710076281539039012382229530463435 10.0.0.72datacenter1 rack1 Up Normal 1.77 GB 20.00% 136112946768375385385349842972707284580 This is a brand new cluster we just brought up and started loading data into a few days ago. It's using the RandomPartitioner, RF=3 on everything, and we're doing QUORUM writes. All keyspaces and CFs are for counter super columns. All keys are moderately sized ascii strings with good variation between them. All supercolumn names are longs. All column names are ascii strings. No decrements are done, no rows or columns are deleted, and read load is almost nonexistent. Column values may get overwritten on account of being incremented because they are counters. This is expected to happen quite a bit. Not all rows are the same length. Insert latency from my hector client box to the cluster averages at 70ms - 200ms, which is really high. Inserts/sec from hector's perspective peaks out at 750/sec, and consistently drops down (and stays at) 120/sec. This is not due to compactions, based on the output of nodetool compactionstats. I wiped the cluster this afternoon, started from scratch, and I'm seeing the same distribution on a smaller scale, with the same latencies. Inserts Going by statistics from cassandra via jmx, it looks like all hosts are getting about the same number of MutationStage Completed Tasks/sec. However, I see one host consistently has Pending MutationStage and ReplicateOnWriteStage tasks (50/30 respectively - 211/42 respectively, throughout the day). Now, I know that ReplicateOnWrite can go really slow if you have large SuperColumns, but I do not. I'm working on proving that at the moment, pending a couple of code pushes. This same box typically runs CPU up around 600-700%, and it's all user space cpu, not IO wait. We monitor these boxes like crazy, and we've tweaked it a bit to try to rule things out (enabling mmap'd io, disabled swap, mounted ext4 with noatime), none of which has made a single bit of difference. If I kill cassandra on that one box, the load moves to the box before it in the ring, ruling out this one box as bad hardware, etc. Mutations and ROWs back up, and cpu jumps to 600%. Heap memory usage sits at 600MB-2GB and heap size is 4G on all 5 boxes. CPU usage and Mutations/ROWs are not affected by hector client connections; if I remove this single host from the hector configuration and confirm that there are 0 connections from my client to this one box, I still see high Mutations and ROWs and CPU usage. If I increase the number of client connections in the hector pool, performance does not change. concurrent_writes are set at 48, concurrent_reads at 32, num cores per box is 8. memtable flush size in mb is 28MB and flush based on ops is 131k. Our memtables flush every 3 minutes (based on graphs, and this aligns exactly with the 131k / (Mutations/sec each box is doing)). commitlog and data are on the same disk, but our disks seem bored. key cache is enabled and I see an almost perfect 100% hit rate. row cache is disabled. My questions are: is this normal to see load unevenly spread out when using RandomPartitioner? how do I fix it? Do I need to assign token ranges manually even with RandomPartitioner? is there a way to see the total row counts assigned to each box? why is this one host running 600% cpu while the rest are sitting at 0%? For reference, here's cfstats taken from the host with the high cpu usage. Keyspace: STATS_TEST Read Count: 18744838 Read Latency: 2.568355930309987 ms. Write Count: 18744845 Write Latency: 0.020453476835898085 ms. Pending Tasks: 0 Column Family: rollup1h SSTable count: 4 Space used (live): 194724367 Space used (total): 260574143 Number of Keys (estimate): 11904 Memtable Columns Count: 34708 Memtable Data Size: 27280700 Memtable Switch Count: 67 Read Count: 9255646 Read Latency: 2.498 ms. Write Count: 9255658 Write Latency: 0.021 ms. Pending Tasks: 0
ReplicateOnWrite issues
This is now my fourth attempt to get the message through. Apologies if you see multiple copies. I've tried to give as much relevant data as I can think of, but please let me know if you need any other info. I spent the day getting jmxtrans to talk to statsd with the cassandra JMX data, so I can provide graphs of just about anything you want. I can also re-run the tests to replicate the problem if need be. As always, thanks for taking a look! I have a 4 host test cluster that I'm writing counters into running 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both gave the same results). Partitioner is the RandomPartitioner, and RF=3. nodetool ring shows that actual data is well balanced: Address DC RackStatus State LoadOwns Token 127605887595351923798765477786913079296 10.0.0.57datacenter1 rack1 Up Normal 1.59 GB 25.00% 0 10.0.0.56datacenter1 rack1 Up Normal 1.39 GB 25.00% 42535295865117307932921825928971026432 10.0.0.55datacenter1 rack1 Up Normal 1.24 GB 25.00% 85070591730234615865843651857942052864 10.0.0.54datacenter1 rack1 Up Normal 1.34 GB 25.00% 127605887595351923798765477786913079296 I also tested against a 1-box test cluster with RF=1. Both show the same results: high performance for a while, and then ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes unresponsive, and the hector client starts throwing exceptions. High performance in this case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so the performance is bounded by the one box owning the row). The cassandra process never seems to recover, even if left for 24 hours. It's still chewing through those ReplicateOnWrite pending tasks. GC doesn't seem to be a factor. Logs show GCs completing in 1s on all 4 boxes in the test cluster, as well as on the 1 standalone server. I'm also seeing highly erratic performance from the cluster in general: inserts/sec usually start at around 3k, but then maybe creep up towards 6k, then drop back down to below 1k for an extended period of time, then maybe 1% of the time they'll spike up to 8k or even 15k, but that doesn't last more than 30 seconds. Here's the tpstats output from all 4 boxes: Slowest/most problematic/highest cpu usage one: This tpstats output is also representative of the one box cluster about 20 minutes into my insert run. 4 $ ./nodetool -h localhost tpstats Pool NameActive Pending Completed ReadStage 0 0 0 RequestResponseStage 0 0 59095840 MutationStage 0 0 31182942 ReadRepairStage 0 0 0 ReplicateOnWriteStage32928042 28246716 GossipStage 1 1 449464 AntiEntropyStage 0 0 0 MigrationStage0 0 4 MemtablePostFlusher 0 0445 StreamStage 0 0 0 FlushWriter 0 0445 FILEUTILS-DELETE-POOL 0 0 38 MiscStage 0 0 0 FlushSorter 0 0 0 InternalResponseStage 0 0 4 HintedHandoff 0 0 0 Second highest cpu usage one: 3 $ ./nodetool -h localhost tpstats Pool NameActive Pending Completed ReadStage 0 0 0 RequestResponseStage 0 0 27413910 MutationStage 0 0 27523094 ReadRepairStage 0 0 0 ReplicateOnWriteStage 0 0 1971 GossipStage 0 0 335422 AntiEntropyStage 0 0 0 MigrationStage0 0 4 MemtablePostFlusher 0 0 66 StreamStage 0 0 0 FlushWriter 0 0 66 FILEUTILS-DELETE-POOL 0 0 31 MiscStage 0 0 0 FlushSorter 0 0 0 InternalResponseStage 0 0 4 HintedHandoff 0 0 0 The other 2 that see
Re: ReplicateOnWrite issues
It's definitely for counters, and some of the rows I'm inserting are long-ish, if 1.3MB is long. Maybe it would help if I said I was using counter super columns. I'm also writing to only a handful of rows at a time, until they are full. It looks like the counter super column code in addReadCommandFromColumnFamily, in the ReplicateOnWrite handling, might not be optimized for this kind of load. It looks like I may be able to turn off replicate_on_write for the CFs as long as I use CL.ONE to get around this in the short term, but I'd like to know what happens when replicate on write is true and when it is false so I can make a more informed choice. On Jul 12, 2011, at 12:20 AM, Yang wrote: interesting, first just to make sure: since replicateOnWrite is for Counters, you are using counters (you use the word insert instead of add/increment ) right? if you are using counters, supposedly the leader runs replicateOnWrite, somehow all your adds find the one box as leader, that's probably something worth investigating. finally, u could try debugging into those replicateOnWrite tasks and see what they are doing exactly, maybe there is something immediately wrong On Mon, Jul 11, 2011 at 11:57 PM, David Hawthorne dha...@gmx.3crowd.com wrote: This is now my fourth attempt to get the message through. Apologies if you see multiple copies. I've tried to give as much relevant data as I can think of, but please let me know if you need any other info. I spent the day getting jmxtrans to talk to statsd with the cassandra JMX data, so I can provide graphs of just about anything you want. I can also re-run the tests to replicate the problem if need be. As always, thanks for taking a look! I have a 4 host test cluster that I'm writing counters into running 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both gave the same results). Partitioner is the RandomPartitioner, and RF=3. nodetool ring shows that actual data is well balanced: Address DC RackStatus State LoadOwns Token 127605887595351923798765477786913079296 10.0.0.57datacenter1 rack1 Up Normal 1.59 GB 25.00% 0 10.0.0.56datacenter1 rack1 Up Normal 1.39 GB 25.00% 42535295865117307932921825928971026432 10.0.0.55datacenter1 rack1 Up Normal 1.24 GB 25.00% 85070591730234615865843651857942052864 10.0.0.54datacenter1 rack1 Up Normal 1.34 GB 25.00% 127605887595351923798765477786913079296 I also tested against a 1-box test cluster with RF=1. Both show the same results: high performance for a while, and then ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes unresponsive, and the hector client starts throwing exceptions. High performance in this case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so the performance is bounded by the one box owning the row). The cassandra process never seems to recover, even if left for 24 hours. It's still chewing through those ReplicateOnWrite pending tasks. GC doesn't seem to be a factor. Logs show GCs completing in 1s on all 4 boxes in the test cluster, as well as on the 1 standalone server. I'm also seeing highly erratic performance from the cluster in general: inserts/sec usually start at around 3k, but then maybe creep up towards 6k, then drop back down to below 1k for an extended period of time, then maybe 1% of the time they'll spike up to 8k or even 15k, but that doesn't last more than 30 seconds. Here's the tpstats output from all 4 boxes: Slowest/most problematic/highest cpu usage one: This tpstats output is also representative of the one box cluster about 20 minutes into my insert run. 4 $ ./nodetool -h localhost tpstats Pool NameActive Pending Completed ReadStage 0 0 0 RequestResponseStage 0 0 59095840 MutationStage 0 0 31182942 ReadRepairStage 0 0 0 ReplicateOnWriteStage32928042 28246716 GossipStage 1 1 449464 AntiEntropyStage 0 0 0 MigrationStage0 0 4 MemtablePostFlusher 0 0445 StreamStage 0 0 0 FlushWriter 0 0445 FILEUTILS-DELETE-POOL 0 0 38 MiscStage 0 0 0 FlushSorter 0 0 0
Re: ReplicateOnWrite issues
Well, I was using a large number of clients: I tried configuring a hector pool of 20-200 to see what affect that had on throughput. There's definitely a point after which there's no gain, so I dialed it back down. To clarify a few other things, when I say inserts I mean increments, as this test only involves counters. When I say a row is full, I mean I've written all of the data into it that belongs in it from our application's perspective. Not all rows are the same length, and the average number of subcolumns per supercolumn is higher for some rows than for others. I'm curious, is it necessary for a read to happen for replication when using RF=1? I am currently running a test against a single node cluster, RF=1, CL.ONE, 4 connections/concurrent increments, and graphing various data points. If something interesting happens, I'll send a graph. When you say stagger writes, do you mean just lower the overall number of increments/sec my client is attempting, or do you mean I should stagger the load amongst many CFs/rows at once? Right now I'm writing heavily into a small set of rows, each going into 3 CFs, before moving on to the next small set of rows. On Jul 12, 2011, at 12:26 PM, Sylvain Lebresne wrote: When you do counter increment at CL.ONE, the write is acknowledged as soon as the first replica getting the the write has pushed the increment into his memtable. However, there is a read happening for the replication to the other replicas (this is necessary to the counter design). What is happening is that our nodes are not able to do those reads fast enough. Or, another way to see it is that you are writing faster than you cluster can handle. I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do something about that, but for now you'll have to stagger your writes. I'm more surprised that tasks on the replicate on write stage would stack up at QUORUM, unless you're using a very large number of clients. -- Sylvain On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne dha...@gmx.3crowd.com wrote: This is now my fourth attempt to get the message through. Apologies if you see multiple copies. I've tried to give as much relevant data as I can think of, but please let me know if you need any other info. I spent the day getting jmxtrans to talk to statsd with the cassandra JMX data, so I can provide graphs of just about anything you want. I can also re-run the tests to replicate the problem if need be. As always, thanks for taking a look! I have a 4 host test cluster that I'm writing counters into running 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both gave the same results). Partitioner is the RandomPartitioner, and RF=3. nodetool ring shows that actual data is well balanced: Address DC RackStatus State LoadOwns Token 127605887595351923798765477786913079296 10.0.0.57datacenter1 rack1 Up Normal 1.59 GB 25.00% 0 10.0.0.56datacenter1 rack1 Up Normal 1.39 GB 25.00% 42535295865117307932921825928971026432 10.0.0.55datacenter1 rack1 Up Normal 1.24 GB 25.00% 85070591730234615865843651857942052864 10.0.0.54datacenter1 rack1 Up Normal 1.34 GB 25.00% 127605887595351923798765477786913079296 I also tested against a 1-box test cluster with RF=1. Both show the same results: high performance for a while, and then ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes unresponsive, and the hector client starts throwing exceptions. High performance in this case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so the performance is bounded by the one box owning the row). The cassandra process never seems to recover, even if left for 24 hours. It's still chewing through those ReplicateOnWrite pending tasks. GC doesn't seem to be a factor. Logs show GCs completing in 1s on all 4 boxes in the test cluster, as well as on the 1 standalone server. I'm also seeing highly erratic performance from the cluster in general: inserts/sec usually start at around 3k, but then maybe creep up towards 6k, then drop back down to below 1k for an extended period of time, then maybe 1% of the time they'll spike up to 8k or even 15k, but that doesn't last more than 30 seconds. Here's the tpstats output from all 4 boxes: Slowest/most problematic/highest cpu usage one: This tpstats output is also representative of the one box cluster about 20 minutes into my insert run. 4 $ ./nodetool -h localhost tpstats Pool NameActive Pending Completed ReadStage 0 0 0
Re: ReplicateOnWrite issues
On Jul 12, 2011, at 3:02 PM, Sylvain Lebresne wrote: On Tue, Jul 12, 2011 at 11:42 PM, David Hawthorne dha...@gmx.3crowd.com wrote: Well, I was using a large number of clients: I tried configuring a hector pool of 20-200 to see what affect that had on throughput. There's definitely a point after which there's no gain, so I dialed it back down. To clarify a few other things, when I say inserts I mean increments, as this test only involves counters. When I say a row is full, I mean I've written all of the data into it that belongs in it from our application's perspective. Not all rows are the same length, and the average number of subcolumns per supercolumn is higher for some rows than for others. Oh yes, you're using supercolumns. That could help magnify the problem a little bit, given that to read any subcolumn of a supercolumn, we have to read the whole supercolumn. Depends how many counters per supercolumns we're talking about. Does this imply that incrementing 50 counters in one supercolumn in one row in one CF requires 50! reads of the supercolumn? One read of the entire supercolumn for each insert? I'm curious, is it necessary for a read to happen for replication when using RF=1? No, it's not necessary with RF=1 ... but we still do it :(. Ok, it's basically a bug, we'll fix that for 0.8.2. I am currently running a test against a single node cluster, RF=1, CL.ONE, 4 connections/concurrent increments, and graphing various data points. If something interesting happens, I'll send a graph. When you say stagger writes, do you mean just lower the overall number of increments/sec my client is attempting, Yes, this is what I mean. or do you mean I should stagger the load amongst many CFs/rows at once? Right now I'm writing heavily into a small set of rows, each going into 3 CFs, before moving on to the next small set of rows. On Jul 12, 2011, at 12:26 PM, Sylvain Lebresne wrote: When you do counter increment at CL.ONE, the write is acknowledged as soon as the first replica getting the the write has pushed the increment into his memtable. However, there is a read happening for the replication to the other replicas (this is necessary to the counter design). What is happening is that our nodes are not able to do those reads fast enough. Or, another way to see it is that you are writing faster than you cluster can handle. I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do something about that, but for now you'll have to stagger your writes. I'm more surprised that tasks on the replicate on write stage would stack up at QUORUM, unless you're using a very large number of clients. -- Sylvain On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne dha...@gmx.3crowd.com wrote: This is now my fourth attempt to get the message through. Apologies if you see multiple copies. I've tried to give as much relevant data as I can think of, but please let me know if you need any other info. I spent the day getting jmxtrans to talk to statsd with the cassandra JMX data, so I can provide graphs of just about anything you want. I can also re-run the tests to replicate the problem if need be. As always, thanks for taking a look! I have a 4 host test cluster that I'm writing counters into running 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both gave the same results). Partitioner is the RandomPartitioner, and RF=3. nodetool ring shows that actual data is well balanced: Address DC RackStatus State Load OwnsToken 127605887595351923798765477786913079296 10.0.0.57datacenter1 rack1 Up Normal 1.59 GB 25.00% 0 10.0.0.56datacenter1 rack1 Up Normal 1.39 GB 25.00% 42535295865117307932921825928971026432 10.0.0.55datacenter1 rack1 Up Normal 1.24 GB 25.00% 85070591730234615865843651857942052864 10.0.0.54datacenter1 rack1 Up Normal 1.34 GB 25.00% 127605887595351923798765477786913079296 I also tested against a 1-box test cluster with RF=1. Both show the same results: high performance for a while, and then ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes unresponsive, and the hector client starts throwing exceptions. High performance in this case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so the performance is bounded by the one box owning the row). The cassandra process never seems to recover, even if left for 24 hours. It's still chewing through those ReplicateOnWrite pending tasks. GC doesn't seem to be a factor. Logs show GCs completing in 1s on all 4 boxes in the test cluster
possible 'coming back to life' bug with counters
Please take a look at this thread over in the hector-users mailing list: http://groups.google.com/group/hector-users/browse_thread/thread/99835159b9ea1766 It looks as if the deleted columns are coming back to life when they shouldn't be. I don't want to open a bug on something if it's already got one that I just couldn't find when I scanned the list of open bugs. I'm using hector 0.8 against cassandra 0.8 release. I can give you whatever logs or files you'd like.
Re: RTG/MRTG/Cricket replacement using Cassandra?
Have you looked at graphite? It would be very cool to see graphite using cassandra as a backend, and then to have cacti feeding data into cassandra. On May 20, 2011, at 4:48 PM, Edward Capriolo wrote: The first love of my open life was cacti. I am going to discuss with them porting some of the system to cassandra. On Friday, May 20, 2011, yangyangyyy tedd...@gmail.com wrote: hi Ryan: Thanks for the link. I read the slides, could you please provide some more details on how the temporal aggregation is implemented? do you use time+granularity as the key ? or as column names ? Thanks Yang In reply to this post by Aaron Turner We have a solution for time series data on cassandra at Twitter that we'd like to open source, but it requires 0.8/trunk so we're not going to release it until that's stable. See http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011 -ryan On Thu, Mar 31, 2011 at 3:56 PM, Aaron Turner [hidden email] wrote: I've been looking at replacing our PostgreSQL backend for RTG (a SNMP based polling and graphing solution for network traffic/ports) with something using Cassandra in order to solve our scalability and redundancy requirements. Based on a lot of what I've read, Cassandra is an ideal data store for this time series data. In fact, Eric Evans in his presentation on the Cassandra home page suggests that this kind of use case is perfect for Cassandra. So this got me wondering if someone had already come up with a CF model for this kind of data, including daily/weekly/monthly/yearly rollups. Perhaps there's even an open source project or two implementing this sorta thing? I've found flewton (https://github.com/flewton/flewton), which is possibly relevant, but my Java skills are pretty non-existent so I'm having a hard time figuring it out. Thanks, Aaron -- Aaron Turner http://synfin.net/ http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero ... [show rest of quote] -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/RTG-MRTG-Cricket-replacement-using-Cassandra-tp6229322p6388192.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: RTG/MRTG/Cricket replacement using Cassandra?
I know cloudkick is doing something like this, and we're developing our own in-house method, but it would be nice for there to be a generically-available package that would do this. Lately I've been wishing that someone would take graphite (written in python) and put the frontend on top of cassandra instead of their proprietary whisper databases. The counter support in 0.8 should help with keeping the rollups up-to-date. On Mar 31, 2011, at 3:56 PM, Aaron Turner wrote: I've been looking at replacing our PostgreSQL backend for RTG (a SNMP based polling and graphing solution for network traffic/ports) with something using Cassandra in order to solve our scalability and redundancy requirements. Based on a lot of what I've read, Cassandra is an ideal data store for this time series data. In fact, Eric Evans in his presentation on the Cassandra home page suggests that this kind of use case is perfect for Cassandra. So this got me wondering if someone had already come up with a CF model for this kind of data, including daily/weekly/monthly/yearly rollups. Perhaps there's even an open source project or two implementing this sorta thing? I've found flewton (https://github.com/flewton/flewton), which is possibly relevant, but my Java skills are pretty non-existent so I'm having a hard time figuring it out. Thanks, Aaron -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero
questions about cassandra-1072/1546
Can we get an update? After reading through the comments on 1072, it looks like this is getting close to finished, but it's hard for someone not knee-deep in the project to tell. I'm primarily interested in the timeline you foresee for getting the increment support into trunk for 0.7, and some documentation around how counters will be supported from the user's perspective - chiefly what a Column and SuperColumn will look like with counters and what the thrift API will be. Some documentation about the remaining issues and concerns we should be aware of when using counters would be good, too, since it looks like there were some in the comments. Again, as someone not knee-deep in the project, it's hard to tell how severe they are or how or when they would apply in general use.