Caution, using the method you described, the amount of data streamed at
the end with the full repair is not the amount of data written between
stopping the first node and the last node, but depends on the table
size, the number of partitions written, their distribution in the ring
and the 'repair_session_space' value. If the table is large, the writes
touch a large number of partitions scattered across the token ring, and
the value of 'repair_session_space' is small, you may end up with a very
expensive over-streaming.
On 07/02/2024 12:33, Sebastian Marsching wrote:
Full repair running for an entire week sounds excessively long. Even
if you've got 1 TB of data per node, 1 week means the repair speed is
less than 2 MB/s, that's very slow. Perhaps you should focus on
finding the bottleneck of the full repair speed and work on that instead.
We store about 3–3.5 TB per node on spinning disks (time-series data),
so I don’t think it is too surprising.
Not disabling auto-compaction may result in repaired SSTables getting
compacted together with unrepaired SSTables before the repair state
is set on them, which leads to mismatch in the repaired data between
nodes, and potentially very expensive over-streaming in a future full
repair. You should follow the documented and tested steps and not
improvise or getting creative if you value your data and time.
There is a different method that we successfully used on three
clusters, but I agree that anti-entropy repair is a tricky business
and one should be cautious with trying less tested methods.
Due to the long time for a full repair (see my earlier explanation),
disabling anticompaction while running the full repair wasn’t an
option for us. It was previously suggested that one could run the
repair per node instead of the full cluster, but I don’t think that
this will work, because only marking the SSTables on a single node as
repaired would lead to massive overstreaming when running the full
repair for the next node that shares data with the first one.
So, I want to describe the method that we used, just in case someone
is in the same situation:
Going around the ring, we temporarily stopped each node and marked all
of its SSTables as repaired. Then we immediately ran a full repair, so
that any inconsistencies in the data that was now marked as repaired
but not actually repaired were fixed.
Using this approach, the amount over over-streaming is minimal (at
least for not too large clusters, where the rolling restart can be
done in less than an hour or so), because the only difference between
the “unrepaired” SSTables on the different nodes will be the data that
was written before stopping the first node and stopping the last node.
Any inconsistencies that might exist in the SSTables that were marked
as repaired should be caught in the full repair, so I do not think it
is too dangerous either. However, I agree that for clusters where a
full repair is quick (e.g. finishes in a few hours), using the
well-tested and frequently used approach is probably better.