Caution, using the method you described, the amount of data streamed at the end with the full repair is not the amount of data written between stopping the first node and the last node, but depends on the table size, the number of partitions written, their distribution in the ring and the 'repair_session_space' value. If the table is large, the writes touch a large number of partitions scattered across the token ring, and the value of 'repair_session_space' is small, you may end up with a very expensive over-streaming.

On 07/02/2024 12:33, Sebastian Marsching wrote:
Full repair running for an entire week sounds excessively long. Even if you've got 1 TB of data per node, 1 week means the repair speed is less than 2 MB/s, that's very slow. Perhaps you should focus on finding the bottleneck of the full repair speed and work on that instead.

We store about 3–3.5 TB per node on spinning disks (time-series data), so I don’t think it is too surprising.

Not disabling auto-compaction may result in repaired SSTables getting compacted together with unrepaired SSTables before the repair state is set on them, which leads to mismatch in the repaired data between nodes, and potentially very expensive over-streaming in a future full repair. You should follow the documented and tested steps and not improvise or getting creative if you value your data and time.

There is a different method that we successfully used on three clusters, but I agree that anti-entropy repair is a tricky business and one should be cautious with trying less tested methods.

Due to the long time for a full repair (see my earlier explanation), disabling anticompaction while running the full repair wasn’t an option for us. It was previously suggested that one could run the repair per node instead of the full cluster, but I don’t think that this will work, because only marking the SSTables on a single node as repaired would lead to massive overstreaming when running the full repair for the next node that shares data with the first one.

So, I want to describe the method that we used, just in case someone is in the same situation:

Going around the ring, we temporarily stopped each node and marked all of its SSTables as repaired. Then we immediately ran a full repair, so that any inconsistencies in the data that was now marked as repaired but not actually repaired were fixed.

Using this approach, the amount over over-streaming is minimal (at least for not too large clusters, where the rolling restart can be done in less than an hour or so), because the only difference between the “unrepaired” SSTables on the different nodes will be the data that was written before stopping the first node and stopping the last node.

Any inconsistencies that might exist in the SSTables that were marked as repaired should be caught in the full repair, so I do not think it is too dangerous either. However, I agree that for clusters where a full repair is quick (e.g. finishes in a few hours), using the well-tested and frequently used approach is probably better.

Reply via email to