Many people here have troubles with repair so I would like to share my
experience regarding the backport of CASSANDRA-12580 "Fix merkle tree size
calculation" (thanks Paulo!) in our C* 2.1.16. I was expecting some minor
improvements but the results are impressive on some tables.
Because of a slow VPN between our EU and US AWS DCs, the massive drop of
overstreaming is a big win for us. On top of that, before the backport I used
to see many RepairException that increased during each repair. With this fix
the graph shows only one exception on one node, so we can say it's negligible.
Such exceptions are not critical because Cassandra-reaper makes a retry but
it's a waste of time.
I run a repair on tables set by set (some sets of tables being more critical,
The most impressive result so far for a set is:
* Before: 23 days (days, not hours)
* With CASSANDRA-12580: 16 hours (yes, hours!)
The improvement is not always dramatic (e.g. 8 hours instead of 39 hours on
another set) but still significant and valuable.
Moreover, considering that:
* repair is a mandatory operation in many use cases
* Paulo already made the patch for 2.1
* C* 2.1 is widely used (the most used?)
I think this bugfix is critical - from an Ops point of view - and should land
in 2.1.17 to be available to people that don't deploy from sources.