I've been testing out cassandra 3.11 (currently using 3.7) and have been observing really high io util occasionally that sometimes results in temporary flatlining at 100% io util for an extended period. I think my use case is pretty simple and currently only testing part of it on this new version so looking for advice on what might be going wrong.
Use Case: I am using cassandra as basically a large "set", my table schema is incredibly simple, just a primary key. Records are all written with the same TTL (7 days). Only queries are inserting a key (which we expect to only happen once) and checking whether that key exists in the table. In my 3.7 cluster I am using DateTieredCompaction and running on c3.4xlarge (x30) in AWS. I've been experimenting with i3.4xlarge and wanted to also try TimeWindowCompaction to see if we could get better performance when adding machines to the cluster, that was always a really painful experience in 3.7 with DateTieredCompaction and the docs say TimeWindowCompaction is ideal for my use case. Right now I am running a new cluster with 3.11 and TimeWindowCompaction alongside the old cluster and doing writes to both. Only reads go to the old cluster while I go through this preliminary testing. So the 3.11 cluster receives between 90K to 150K writes/second and no reads. This morning for a period of about 30 minutes the cluster was at 100% ioutil and eventually recovered from this state. At that time it was only receiving ~100K writes/second. I don't see anything interesting in the logs that indicate what is going on, and I don't think a sudden compaction is the issue since I have limits on compaction throughput. Staying on 3.7 would be a major bummer so looking for advice. Some information that might be useful: compaction throughput - 16MB/s concurrent compactors - 4 machine type - i3.4xlarge (x20) disk - RAID0 across 2 NVMe SSDs Table Schema looks like this: CREATE TABLE prod_dedupe.event_hashes ( app int, hash_value blob, PRIMARY KEY ((app, hash_value)) ) WITH bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND comment = 'For deduping' AND compaction = {'class': 'org.apache.cassandra.db.compa ction.TimeWindowCompactionStrategy', 'compaction_window_size': '4', 'compaction_window_unit': 'HOURS', 'max_threshold': '64', 'min_threshold': '4'} AND compression = {'chunk_length_in_kb': '4', 'class': ' org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 0 AND gc_grace_seconds = 3600 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = 'NONE'; Thanks, Kurt