So I looked at the code and the compaction seems to concentrate / find "most interesting" the newer buckets for compaction.
Our problem is that we have a huge fragmented number of sstables in buckets that are a few days old. (not yet expired, our expiration is 7 days), so the sstable selection algorithm doesn't find those particularly "interesting" Perhaps we should have something that tries to stabilize the sstable count across buckets, maybe with some configurable thresholds for deciding what to prioritize.... So even though we opened the floodgates on the compaction throughput and compactors on the nodes with elevated sstable count, they are still basically working on the newer/incoming data. We will probably wait for the 7 days and hope all those fragmented tables then get nuked. We could use jmx black magic to force merging, but I know TWCS has metadata on the sstables identifying their bucket, and I'm not sure if manually forcing compaction would disrupt that metadata. We will vertically scale aws instances if we need to in the short run, we have stabilized the sstable counts on the nodes that have elevated levels, and we shall see if things return to normal in three or four more days when the fragments expire. On Tue, Jul 9, 2019 at 11:12 AM Carl Mueller <[email protected]> wrote: > The existing 15 node cluster had about 450-500GB/node, most in one TWCS > table. Data is applied with a 7-day TTL. Our cluster couldn't be expanded > due to a bit of political foot dragging and new load of about 2x-3x started > up around the time we started expanding. > > about 500 sstables per node, with one outlier of 16,000 files (.Data.db to > be clear). > > The 160000 data.db sstable files grew from 500 steadily over a week. > Probably compaction fell behind that was exacerbated by growing load, but > the sstable count growth appears to have started before the heaviest load > increases. > > We attempted to expand figuring the cluster was under duress. The first > addition still had 150,000 files/25,000 Data.db files, and about 500 GB > > three other nodes have started to gain in number of files as well. > > Our last attempted expand filled a 2 terabyte disk and we ended up with > over 100,000 Data.db sstable files and 600000 files overall, and it hadn't > finished. We killed that node. > > Wide rows do not appear to be a problem. > > We are vertically scaling our nodes to bigger hardware and unthrottling > compaction and doubling compactors on the nodes that are starting to > inflate numbers of sstables, that appears to be helping. > > But the overstreaming is still a mystery. > > Table compaction settings: > > ) WITH bloom_filter_fp_chance = 0.01 > AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' > AND comment = '' > AND compaction = {'compaction_window_unit': 'HOURS', > 'compaction_window_size': '4', 'class': > 'com.jeffjirsa.cassandra.db.compaction.TimeWindowCompactionStrategy'} > AND compression = {'sstable_compression': > 'org.apache.cassandra.io.compress.LZ4Compressor'} > AND dclocal_read_repair_chance = 0.1 > AND default_time_to_live = 0 > AND gc_grace_seconds = 0 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.0 > AND speculative_retry = '99.0PERCENTILE'; > > > > > > >
