Hello, I am evaluating using cassandra and I'm running into some strange IO behavior that I can't explain, I'd like some help/ideas to troubleshoot it.
I am running a 1 node cluster with a keyspace consisting of two columns families, one of which has dozens of supercolumns itself containing dozens of columns. All in all, this is a couple gigabytes of data, 12GB on the hard drive. The hardware is pretty good : 16GB memory + RAID-0 SSD drives with LVM and an i5 processor (4 cores). Keyspace: xxxxxxxxxxxxxxxxxxx Read Count: 460754852 Read Latency: 1.108205793092766 ms. Write Count: 30620665 Write Latency: 0.01411020877567486 ms. Pending Tasks: 0 Column Family: xxxxxxxxxxxxxxxxxxxxxxxxxx SSTable count: 5 Space used (live): 548700725 Space used (total): 548700725 Memtable Columns Count: 0 Memtable Data Size: 0 Memtable Switch Count: 11 Read Count: 2891192 Read Latency: NaN ms. Write Count: 3157547 Write Latency: NaN ms. Pending Tasks: 0 Key cache capacity: 367396 Key cache size: 367396 Key cache hit rate: NaN Row cache capacity: 112683 Row cache size: 112683 Row cache hit rate: NaN Compacted row minimum size: 125 Compacted row maximum size: 924 Compacted row mean size: 172 Column Family: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy SSTable count: 7 Space used (live): 8707538781 Space used (total): 8707538781 Memtable Columns Count: 0 Memtable Data Size: 0 Memtable Switch Count: 30 Read Count: 457863660 Read Latency: 2.381 ms. Write Count: 27463118 Write Latency: NaN ms. Pending Tasks: 0 Key cache capacity: 4518387 Key cache size: 4518387 Key cache hit rate: 0.9247881700850826 Row cache capacity: 1349682 Row cache size: 1349682 Row cache hit rate: 0.39400533823415573 Compacted row minimum size: 125 Compacted row maximum size: 6866 Compacted row mean size: 165 My app makes a bunch of requests using a MultigetSuperSliceQuery for a set of keys, typically a couple dozen at most. It also selects a subset of the supercolumns. I am running 8 requests in parallel at most. Two days, I ran a 1.5 hour process that basically read every key. The server had no IOwaits and everything was humming along. However, right at the end of the process, there was a huge spike in IOs. I didn't think much of it. Today, after two days of inactivity, any query I run raises the IOs to 80% utilization of the SSD drives even though I'm running the same query over and over (no cache??) Any ideas on how to troubleshoot this, or better, how to solve this ? thanks Philippe