I've been testing out cassandra 3.11 (currently using 3.7) and have been
observing really high io util occasionally that sometimes results in
temporary flatlining at 100% io util for an extended period. I think my use
case is pretty simple and currently only testing part of it on this new
version so looking for advice on what might be going wrong.

Use Case: I am using cassandra as basically a large "set", my table schema
is incredibly simple, just a primary key. Records are all written with the
same TTL (7 days). Only queries are inserting a key (which we expect to
only happen once) and checking whether that key exists in the table. In my
3.7 cluster I am using DateTieredCompaction and running on c3.4xlarge (x30)
in AWS. I've been experimenting with i3.4xlarge and wanted to also try
TimeWindowCompaction to see if we could get better performance when adding
machines to the cluster, that was always a really painful experience in 3.7
with DateTieredCompaction and the docs say TimeWindowCompaction is ideal
for my use case.

Right now I am running a new cluster with 3.11 and TimeWindowCompaction
alongside the old cluster and doing writes to both. Only reads go to the
old cluster while I go through this preliminary testing. So the 3.11
cluster receives between 90K to 150K writes/second and no reads. This
morning for a period of about 30 minutes the cluster was at 100% ioutil and
eventually recovered from this state. At that time it was only receiving
~100K writes/second. I don't see anything interesting in the logs that
indicate what is going on, and I don't think a sudden compaction is the
issue since I have limits on compaction throughput.

Staying on 3.7 would be a major bummer so looking for advice.

Some information that might be useful:

compaction throughput - 16MB/s
concurrent compactors - 4
machine type - i3.4xlarge (x20)
disk - RAID0 across 2 NVMe SSDs

Table Schema looks like this:

CREATE TABLE prod_dedupe.event_hashes (

    app int,

    hash_value blob,

    PRIMARY KEY ((app, hash_value))

) WITH bloom_filter_fp_chance = 0.01

    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}

    AND comment = 'For deduping'

    AND compaction = {'class': 'org.apache.cassandra.db.compa
ction.TimeWindowCompactionStrategy', 'compaction_window_size': '4',
'compaction_window_unit': 'HOURS', 'max_threshold': '64', 'min_threshold':

    AND compression = {'chunk_length_in_kb': '4', 'class': '

    AND crc_check_chance = 1.0

    AND dclocal_read_repair_chance = 0.0

    AND default_time_to_live = 0

    AND gc_grace_seconds = 3600

    AND max_index_interval = 2048

    AND memtable_flush_period_in_ms = 0

    AND min_index_interval = 128

    AND read_repair_chance = 0.0

    AND speculative_retry = 'NONE';


