> > > We're running with 128G memory and 30G heap size. Maybe it's good idea > to increase the commitlog_total_space. On the other hand, even with 8G > commitlog_total_space, replaying CL after restart takes more than 5 > minutes. > > In our case, the actual problem is it's causing lots of read repair > timeouts as the repair mutations are dropped. Which causes Cassandra JVM > hang or sometimes crash. >
Do you have a mix of a small number of really heavily written to tables and a larger number of tables with fewer writes? One thing I've had success with when waitingOnSegmentAllocation spiked is setting memtable_flush_period_in_ms on the less busy tables (obviously not all the same so you don't flush storm). This seems to keep the block-and-tackle CL rotation cleaner with fewer tables to flush.