The over-streaming is only problematic for the repaired SSTables, but it can be triggered by inconsistencies within the unrepaired SSTables during an incremental repair session. This is because although an incremental repair will only compare the unrepaired SSTables, but it will stream both the unrepaired and repaired SSTables for the inconsistent token ranges. Keep in mind that the source SSTables for streaming is selected based on the token ranges, not the repaired/unrepaired state.

Base on the above, I'm unsure running an incremental repair before a full repair can fully avoid the over-streaming issue.

On 07/02/2024 22:41, Sebastian Marsching wrote:
Thank you very much for your explanation.

Streaming happens on the token range level, not the SSTable level, right? So, 
when running an incremental repair before the full repair, the problem that 
“some unrepaired SSTables are being marked as repaired on one node but not on 
another” should not exist any longer. Now this data should be marked as 
repaired on all nodes.

Thus, when repairing the SSTables that are marked as repaired, this data should 
be included on all nodes when calculating the Merkle trees and no overstreaming 
should happen.

Of course, this means that running an incremental repair *first* after marking 
SSTables as repaired and only running the full repair *after* that is critical. 
I have to admit that previously I wasn’t fully aware of how critical this step 
is.

Am 07.02.2024 um 20:22 schrieb Bowen Song via user <user@cassandra.apache.org>:

Unfortunately repair doesn't compare each partition individually. Instead, it 
groups multiple partitions together and calculate a hash of them, stores the 
hash in a leaf of a merkle tree, and then compares the merkle trees between 
replicas during a repair session. If any one of the partitions covered by a 
leaf is inconsistent between replicas, the hash values in these leaves will be 
different, and all partitions covered by the same leaf will need to be streamed 
in full.

Knowing that, and also know that your approach can create a lots of 
inconsistencies in the repaired SSTables because some unrepaired SSTables are 
being marked as repaired on one node but not on another, you would then 
understand why over-streaming can happen. The over-streaming is only 
problematic for the repaired SSTables, because they are much bigger than the 
unrepaired.


On 07/02/2024 17:00, Sebastian Marsching wrote:
Caution, using the method you described, the amount of data streamed at the end 
with the full repair is not the amount of data written between stopping the 
first node and the last node, but depends on the table size, the number of 
partitions written, their distribution in the ring and the 
'repair_session_space' value. If the table is large, the writes touch a large 
number of partitions scattered across the token ring, and the value of 
'repair_session_space' is small, you may end up with a very expensive 
over-streaming.
Thanks for the warning. In our case it worked well (obviously we tested it on a 
test cluster before applying it on the production clusters), but it is good to 
know that this might not always be the case.

Maybe I misunderstand how full and incremental repairs work in C* 4.x. I would 
appreciate if you could clarify this for me.

So far, I assumed that a full repair on a cluster that is also using 
incremental repair pretty much works like on a cluster that is not using 
incremental repair at all, the only difference being that the set of repaired 
und unrepaired data is repaired separately, so the Merkle trees that are 
calculated for repaired and unrepaired data are completely separate.

I also assumed that incremental repair only looks at unrepaired data, which is 
why it is so fast.

Is either of these two assumptions wrong?

If not, I do not quite understand how a lot of overstreaming might happen, as 
long as (I forgot to mention this step in my original e-mail) I run an 
incremental repair directly after restarting the nodes and marking all data as 
repaired.

I understand that significant overstreaming might happen during this first 
repair (in the worst case streaming all the unrepaired data that a node 
stores), but due to the short amount of time between starting to mark data as 
repaired and running the incremental repair, the whole set of unrepaired data 
should be rather small, so this overstreaming should not cause any issues.

 From this point on, the unrepaired data on the different nodes should be in 
sync and discrepancies in the repaired data during the full repair should not 
be bigger than they had been if I had run a full repair without marking an data 
as repaired.

I would really appreciate if you could point out the hole in this reasoning. 
Maybe I have a fundamentally wrong understanding of the repair process, and if 
I do I would like to correct this.

Reply via email to