$ ./nodetool -h localhost ring Address DC Rack Status State Load Owns Token 136112946768375385385349842972707284580 10.0.0.57 datacenter1 rack1 Up Normal 8.31 GB 20.00% 0 10.0.0.56 datacenter1 rack1 Up Normal 13.7 GB 20.00% 34028236692093846346337460743176821145 10.0.0.55 datacenter1 rack1 Up Normal 13.87 GB 20.00% 68056473384187692692674921486353642290 10.0.0.54 datacenter1 rack1 Up Normal 8.03 GB 20.00% 102084710076281539039012382229530463435 10.0.0.72 datacenter1 rack1 Up Normal 1.77 GB 20.00% 136112946768375385385349842972707284580
This is a brand new cluster we just brought up and started loading data into a few days ago. It's using the RandomPartitioner, RF=3 on everything, and we're doing QUORUM writes. All keyspaces and CFs are for counter super columns. All keys are moderately sized ascii strings with good variation between them. All supercolumn names are longs. All column names are ascii strings. No decrements are done, no rows or columns are deleted, and read load is almost nonexistent. Column values may get overwritten on account of being incremented because they are counters. This is expected to happen quite a bit. Not all rows are the same length. Insert latency from my hector client box to the cluster averages at 70ms - 200ms, which is really high. Inserts/sec from hector's perspective peaks out at 750/sec, and consistently drops down (and stays at) 120/sec. This is not due to compactions, based on the output of nodetool compactionstats. I wiped the cluster this afternoon, started from scratch, and I'm seeing the same distribution on a smaller scale, with the same latencies. Inserts Going by statistics from cassandra via jmx, it looks like all hosts are getting about the same number of MutationStage Completed Tasks/sec. However, I see one host consistently has Pending MutationStage and ReplicateOnWriteStage tasks (50/30 respectively - 211/42 respectively, throughout the day). Now, I know that ReplicateOnWrite can go really slow if you have large SuperColumns, but I do not. I'm working on proving that at the moment, pending a couple of code pushes. This same box typically runs CPU up around 600-700%, and it's all user space cpu, not IO wait. We monitor these boxes like crazy, and we've tweaked it a bit to try to rule things out (enabling mmap'd io, disabled swap, mounted ext4 with noatime), none of which has made a single bit of difference. If I kill cassandra on that one box, the load moves to the box before it in the ring, ruling out this one box as bad hardware, etc. Mutations and ROWs back up, and cpu jumps to 600%. Heap memory usage sits at 600MB-2GB and heap size is 4G on all 5 boxes. CPU usage and Mutations/ROWs are not affected by hector client connections; if I remove this single host from the hector configuration and confirm that there are 0 connections from my client to this one box, I still see high Mutations and ROWs and CPU usage. If I increase the number of client connections in the hector pool, performance does not change. concurrent_writes are set at 48, concurrent_reads at 32, num cores per box is 8. memtable flush size in mb is 28MB and flush based on ops is 131k. Our memtables flush every 3 minutes (based on graphs, and this aligns exactly with the 131k / (Mutations/sec each box is doing)). commitlog and data are on the same disk, but our disks seem bored. key cache is enabled and I see an almost perfect 100% hit rate. row cache is disabled. My questions are: is this normal to see load unevenly spread out when using RandomPartitioner? how do I fix it? Do I need to assign token ranges manually even with RandomPartitioner? is there a way to see the total row counts assigned to each box? why is this one host running 600% cpu while the rest are sitting at 0%? For reference, here's cfstats taken from the host with the high cpu usage. Keyspace: STATS_TEST Read Count: 18744838 Read Latency: 2.568355930309987 ms. Write Count: 18744845 Write Latency: 0.020453476835898085 ms. Pending Tasks: 0 Column Family: rollup1h SSTable count: 4 Space used (live): 194724367 Space used (total): 260574143 Number of Keys (estimate): 11904 Memtable Columns Count: 34708 Memtable Data Size: 27280700 Memtable Switch Count: 67 Read Count: 9255646 Read Latency: 2.498 ms. Write Count: 9255658 Write Latency: 0.021 ms. Pending Tasks: 0 Key cache capacity: 200000 Key cache size: 91254 Key cache hit rate: 0.9950598390225411 Row cache: disabled Compacted row minimum size: 150 Compacted row maximum size: 52066354 Compacted row mean size: 17404 Column Family: rollup5m SSTable count: 4 Space used (live): 296161119 Space used (total): 402687415 Number of Keys (estimate): 10496 Memtable Columns Count: 34742 Memtable Data Size: 34607575 Memtable Switch Count: 67 Read Count: 9255681 Read Latency: 2.700 ms. Write Count: 9255687 Write Latency: 0.020 ms. Pending Tasks: 0 Key cache capacity: 200000 Key cache size: 88629 Key cache hit rate: 0.9956045403129263 Row cache: disabled Compacted row minimum size: 150 Compacted row maximum size: 129557750 Compacted row mean size: 25562