Hi, I am using Cassandra 2.0.3 and we use STCS for all CFs. We have recently faced an issue where sstable count of certain CFs went into THOUSANDS. We realized that every week, when "repair -pr" ran on each node, it created 50+ tiny sstables of around 1kb. These tables were never compacted during minor compactions and thus sstable count kept on increasing with each repair. Our Root Cause Analysis is as under:We were writing to 5 CFs simultaneously and one CF had 3 Secondary Indexes. Our memtable_flush_writers was set to default 1, which created bottleneck for writes from all CFs and lead to DROPPED mutations. Thus, many vnodes were damaged at the time of every weekly repair. While repairing inconsistent data, Repair created 50+ tiny sstables for each repair -pr ran on nodes.
Why tiny sstables created during repair didn't compact? As 2.0.3 has a known issue ( https://issues.apache.org/jira/browse/CASSANDRA-6483 ), where even if you dont specify cold_reads_to_omit or set it zero, still cold sstables are not compacted. We think that this prevented these tiny sstables from participating in minor compactions. Moreover, we had long GC pauses leading to nodes being marked down. Our Fix:1. Increases memtable_flush_writers to 3 , so that mutations are not dropped and data is consistent.As most of the data is consistent, usual repair would not create tiny sstables to repair vnode ranges.2. Executed major compaction on all sstables where sstable count was in thousands to control the situation.3. We also did some compaction throttling, reduced total_memtable_space_in_mb and did JVM tuning to prevent long GC pauses. Queries:1.We have observed that after increasing memtable_flush_writers, doing compaction throttling and tuning JVM , tiny sstables which were not getting compacted during Repairs, started participating in compactions and sstable count for few CFs reduced considerably after repair(even though all tiny sstables were not compacted). As per our RCA, we understood that tiny sstables created with every repair are not getting compacted due to COLDNESS. What lead to compaction of these tiny tables now? How our changes affected minor compactions? Is there any gap in Root Cause Analysis? 2. We thought that CQL compaction subproperty of tombstone_threshold will help us after major compactions. This property will ensure that even if we have one huge sstable, once tombstone threshold of 20% has reached, sstables will be compacted and tombstones will be dropped after gc_grace_periods (even if there no similar sized sstables as need by STCS). But in our initial testing, single huge sstable is not getting compacted even if we drop all rows in it and gc_grace_period has passed. Why tombstone_threshold is behaving like that? ThanksAnuj Wadehra 1. High sstable count2. damaged vnodes and coldness issue3. Our solution: We tuned our db to stop dropped mutation and thus damaged vnodes creating amll tables during repair and did major compaction how to deal with side effects of major compaction?1. tombstone_threshhold compaction not triggered. STCS anyways never ensure reads through one sstabble 2. If coldness was the issue then Why db sync lead to compaction of cold sstables now ? Does date read during repair counted as read and ths data is made hot and compacted?