I’ve confirmed that the inconsistency issues disappeared after repair
> finished.
> Anything changed with repair in 3.11.1? One difference I noticed is that
> the validation step during repair could turn down the node upon large
> tables, which never happen in 3.10. I had to throttle validation requests
> to let it pass. Also I switched back to -pr instead of incremental repair
> which is a resource killer and often hangs for the first node to be
> repaired.

When you switched back to non-incremental did you set `repairedAt` on all
sstables (on all nodes) back to zero (or unrepaired state)?
This should have been done with `sstablerepairedset --is-unrepaired … `
while the node is stopped.

> To address the inconsistency issue, I could do Write All and Read One by
> giving up availability and stop running repair. Any comments on that?

You loose availability doing this, and at the number of reads you're doing
I would not recommend it.
You could think about using a fallback strategy that initially tries CL.ALL
and falls back to CL.QUORUM. But this is a hack, could overload your
cluster, and if there's any correlation to dropped messages or flapping
nodes won't help.

I'd also be prepared to upgrade to 3.11.3, when it does get released.


Mick Semb Wever

The Last Pickle
Apache Cassandra Consulting

Reply via email to