It's definitely for counters, and some of the rows I'm inserting are long-ish, if 1.3MB is long.
Maybe it would help if I said I was using counter super columns. I'm also writing to only a handful of rows at a time, until they are full. It looks like the counter super column code in addReadCommandFromColumnFamily, in the ReplicateOnWrite handling, might not be optimized for this kind of load. It looks like I may be able to turn off replicate_on_write for the CFs as long as I use CL.ONE to get around this in the short term, but I'd like to know what happens when replicate on write is true and when it is false so I can make a more informed choice. On Jul 12, 2011, at 12:20 AM, Yang wrote: > interesting, > > first just to make sure: since replicateOnWrite is for Counters, you > are using counters (you use the word "insert" instead of > "add/increment" ) right? > > if you are using counters, supposedly the leader runs > replicateOnWrite, somehow all your adds find the one box as leader, > that's probably something worth investigating. > > finally, u could try debugging into those replicateOnWrite tasks and > see what they are doing exactly, maybe there is something immediately > wrong > > > > On Mon, Jul 11, 2011 at 11:57 PM, David Hawthorne <dha...@gmx.3crowd.com> > wrote: >> This is now my fourth attempt to get the message through. Apologies if you >> see multiple copies. >> >> I've tried to give as much relevant data as I can think of, but please let >> me know if you need any other info. I spent the day getting jmxtrans to >> talk to statsd with the cassandra JMX data, so I can provide graphs of just >> about anything you want. I can also re-run the tests to replicate the >> problem if need be. As always, thanks for taking a look! >> >> I have a 4 host test cluster that I'm writing counters into running 0.8.1, >> and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run >> with each and both gave the same results). Partitioner is the >> RandomPartitioner, and RF=3. nodetool ring shows that actual data is well >> balanced: >> >> Address DC Rack Status State Load Owns >> Token >> >> 127605887595351923798765477786913079296 >> 10.0.0.57 datacenter1 rack1 Up Normal 1.59 GB 25.00% 0 >> 10.0.0.56 datacenter1 rack1 Up Normal 1.39 GB 25.00% >> 42535295865117307932921825928971026432 >> 10.0.0.55 datacenter1 rack1 Up Normal 1.24 GB 25.00% >> 85070591730234615865843651857942052864 >> 10.0.0.54 datacenter1 rack1 Up Normal 1.34 GB 25.00% >> 127605887595351923798765477786913079296 >> >> I also tested against a 1-box test cluster with RF=1. >> >> Both show the same results: high performance for a while, and then >> ReplicateOnWrite backs way up (I've seen it as high as a million), the >> cassandra process becomes unresponsive, and the hector client starts >> throwing exceptions. High performance in this case equates to about 5-20k >> inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row >> at a time, so the performance is bounded by the one box owning the row). >> The cassandra process never seems to recover, even if left for 24 hours. >> It's still chewing through those ReplicateOnWrite pending tasks. >> >> GC doesn't seem to be a factor. Logs show GCs completing in < 1s on all 4 >> boxes in the test cluster, as well as on the 1 standalone server. >> >> I'm also seeing highly erratic performance from the cluster in general: >> inserts/sec usually start at around 3k, but then maybe creep up towards 6k, >> then drop back down to below 1k for an extended period of time, then maybe >> 1% of the time they'll spike up to 8k or even 15k, but that doesn't last >> more than 30 seconds. >> >> Here's the tpstats output from all 4 boxes: >> >> Slowest/most problematic/highest cpu usage one: >> >> This tpstats output is also representative of the one box cluster about 20 >> minutes into my insert run. >> >> 4 $ ./nodetool -h localhost tpstats >> Pool Name Active Pending Completed >> ReadStage 0 0 0 >> RequestResponseStage 0 0 59095840 >> MutationStage 0 0 31182942 >> ReadRepairStage 0 0 0 >> ReplicateOnWriteStage 32 928042 28246716 >> GossipStage 1 1 449464 >> AntiEntropyStage 0 0 0 >> MigrationStage 0 0 4 >> MemtablePostFlusher 0 0 445 >> StreamStage 0 0 0 >> FlushWriter 0 0 445 >> FILEUTILS-DELETE-POOL 0 0 38 >> MiscStage 0 0 0 >> FlushSorter 0 0 0 >> InternalResponseStage 0 0 4 >> HintedHandoff 0 0 0 >> >> Second highest cpu usage one: >> >> 3 $ ./nodetool -h localhost tpstats >> Pool Name Active Pending Completed >> ReadStage 0 0 0 >> RequestResponseStage 0 0 27413910 >> MutationStage 0 0 27523094 >> ReadRepairStage 0 0 0 >> ReplicateOnWriteStage 0 0 11111971 >> GossipStage 0 0 335422 >> AntiEntropyStage 0 0 0 >> MigrationStage 0 0 4 >> MemtablePostFlusher 0 0 66 >> StreamStage 0 0 0 >> FlushWriter 0 0 66 >> FILEUTILS-DELETE-POOL 0 0 31 >> MiscStage 0 0 0 >> FlushSorter 0 0 0 >> InternalResponseStage 0 0 4 >> HintedHandoff 0 0 0 >> >> The other 2 that see minimal cpu usage: >> >> 2 $ ./nodetool -h localhost tpstats >> Pool Name Active Pending Completed >> ReadStage 0 0 0 >> RequestResponseStage 0 0 9213999 >> MutationStage 0 0 33233114 >> ReadRepairStage 0 0 0 >> ReplicateOnWriteStage 0 0 0 >> GossipStage 0 0 336016 >> AntiEntropyStage 0 0 0 >> MigrationStage 0 0 4 >> MemtablePostFlusher 0 0 77 >> StreamStage 0 0 0 >> FlushWriter 0 0 77 >> FILEUTILS-DELETE-POOL 0 0 40 >> MiscStage 0 0 0 >> FlushSorter 0 0 0 >> InternalResponseStage 0 0 4 >> HintedHandoff 0 0 0 >> >> >> 1 $ ./nodetool -h localhost tpstats >> Pool Name Active Pending Completed >> ReadStage 0 0 0 >> RequestResponseStage 0 0 13796875 >> MutationStage 0 0 36116662 >> ReadRepairStage 0 0 0 >> ReplicateOnWriteStage 0 0 0 >> GossipStage 0 0 336496 >> AntiEntropyStage 0 0 0 >> MigrationStage 0 0 4 >> MemtablePostFlusher 0 0 91 >> StreamStage 0 0 0 >> FlushWriter 0 0 91 >> FILEUTILS-DELETE-POOL 0 0 43 >> MiscStage 0 0 0 >> FlushSorter 0 0 0 >> InternalResponseStage 0 0 4 >> HintedHandoff 0 0 0