Re: Switching to Incremental Repair

2024-02-15 Thread Chris Lohfink
I would recommend adding something to C* to be able to flip the repaired
state on all sstables quickly (with default OSS can turn nodes off one at a
time and use sstablerepairedset). It's a life saver to be able to revert
back to non-IR if migration going south. Same can be used to quickly switch
into IR sstables with more caveats. Probably worth a jira to add a faster
solution

On Thu, Feb 15, 2024 at 12:50 PM Kristijonas Zalys  wrote:

> Hi folks,
>
> One last question regarding incremental repair.
>
> What would be a safe approach to temporarily stop running incremental
> repair on a cluster (e.g.: during a Cassandra major version upgrade)? My
> understanding is that if we simply stop running incremental repair, the
> cluster's nodes can, in the worst case, double in disk size as the repaired
> dataset will not get compacted with the unrepaired dataset. Similar to
> Sebastian, we have nodes where the disk usage is multiple TiBs so
> significant growth can be quite dangerous in our case. Would the only safe
> choice be to mark all SSTables as unrepaired before stopping regular
> incremental repair?
>
> Thanks,
> Kristijonas
>
>
> On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> The over-streaming is only problematic for the repaired SSTables, but it
>> can be triggered by inconsistencies within the unrepaired SSTables
>> during an incremental repair session. This is because although an
>> incremental repair will only compare the unrepaired SSTables, but it
>> will stream both the unrepaired and repaired SSTables for the
>> inconsistent token ranges. Keep in mind that the source SSTables for
>> streaming is selected based on the token ranges, not the
>> repaired/unrepaired state.
>>
>> Base on the above, I'm unsure running an incremental repair before a
>> full repair can fully avoid the over-streaming issue.
>>
>> On 07/02/2024 22:41, Sebastian Marsching wrote:
>> > Thank you very much for your explanation.
>> >
>> > Streaming happens on the token range level, not the SSTable level,
>> right? So, when running an incremental repair before the full repair, the
>> problem that “some unrepaired SSTables are being marked as repaired on one
>> node but not on another” should not exist any longer. Now this data should
>> be marked as repaired on all nodes.
>> >
>> > Thus, when repairing the SSTables that are marked as repaired, this
>> data should be included on all nodes when calculating the Merkle trees and
>> no overstreaming should happen.
>> >
>> > Of course, this means that running an incremental repair *first* after
>> marking SSTables as repaired and only running the full repair *after* that
>> is critical. I have to admit that previously I wasn’t fully aware of how
>> critical this step is.
>> >
>> >> Am 07.02.2024 um 20:22 schrieb Bowen Song via user <
>> user@cassandra.apache.org>:
>> >>
>> >> Unfortunately repair doesn't compare each partition individually.
>> Instead, it groups multiple partitions together and calculate a hash of
>> them, stores the hash in a leaf of a merkle tree, and then compares the
>> merkle trees between replicas during a repair session. If any one of the
>> partitions covered by a leaf is inconsistent between replicas, the hash
>> values in these leaves will be different, and all partitions covered by the
>> same leaf will need to be streamed in full.
>> >>
>> >> Knowing that, and also know that your approach can create a lots of
>> inconsistencies in the repaired SSTables because some unrepaired SSTables
>> are being marked as repaired on one node but not on another, you would then
>> understand why over-streaming can happen. The over-streaming is only
>> problematic for the repaired SSTables, because they are much bigger than
>> the unrepaired.
>> >>
>> >>
>> >> On 07/02/2024 17:00, Sebastian Marsching wrote:
>>  Caution, using the method you described, the amount of data streamed
>> at the end with the full repair is not the amount of data written between
>> stopping the first node and the last node, but depends on the table size,
>> the number of partitions written, their distribution in the ring and the
>> 'repair_session_space' value. If the table is large, the writes touch a
>> large number of partitions scattered across the token ring, and the value
>> of 'repair_session_space' is small, you may end up with a very expensive
>> over-streaming.
>> >>> Thanks for the warning. In our case it worked well (obviously we
>> tested it on a test cluster before applying it on the production clusters),
>> but it is good to know that this might not always be the case.
>> >>>
>> >>> Maybe I misunderstand how full and incremental repairs work in C*
>> 4.x. I would appreciate if you could clarify this for me.
>> >>>
>> >>> So far, I assumed that a full repair on a cluster that is also using
>> incremental repair pretty much works like on a cluster that is not using
>> incremental repair at all, the only difference being that the set 

Re: Switching to Incremental Repair

2024-02-15 Thread Bowen Song via user
The gc_grace_seconds, which default to 10 days, is the maximal safe 
interval between repairs. How much data gets written during that period 
of time? Will your nodes run out of disk space because of the new data 
written during that time? If so, it sounds like your nodes are 
dangerously close to running out of disk space, and you should address 
that issue first before even considering upgrading Cassandra.


On 15/02/2024 18:49, Kristijonas Zalys wrote:

Hi folks,

One last question regarding incremental repair.

What would be a safe approach to temporarily stop running incremental 
repair on a cluster (e.g.: during a Cassandra major version upgrade)? 
My understanding is that if we simply stop running incremental repair, 
the cluster's nodes can, in the worst case, double in disk size as the 
repaired dataset will not get compacted with the unrepaired dataset. 
Similar to Sebastian, we have nodes where the disk usage is multiple 
TiBs so significant growth can be quite dangerous in our case. Would 
the only safe choice be to mark all SSTables as unrepaired before 
stopping regular incremental repair?


Thanks,
Kristijonas


On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user 
 wrote:


The over-streaming is only problematic for the repaired SSTables,
but it
can be triggered by inconsistencies within the unrepaired SSTables
during an incremental repair session. This is because although an
incremental repair will only compare the unrepaired SSTables, but it
will stream both the unrepaired and repaired SSTables for the
inconsistent token ranges. Keep in mind that the source SSTables for
streaming is selected based on the token ranges, not the
repaired/unrepaired state.

Base on the above, I'm unsure running an incremental repair before a
full repair can fully avoid the over-streaming issue.

On 07/02/2024 22:41, Sebastian Marsching wrote:
> Thank you very much for your explanation.
>
> Streaming happens on the token range level, not the SSTable
level, right? So, when running an incremental repair before the
full repair, the problem that “some unrepaired SSTables are being
marked as repaired on one node but not on another” should not
exist any longer. Now this data should be marked as repaired on
all nodes.
>
> Thus, when repairing the SSTables that are marked as repaired,
this data should be included on all nodes when calculating the
Merkle trees and no overstreaming should happen.
>
> Of course, this means that running an incremental repair *first*
after marking SSTables as repaired and only running the full
repair *after* that is critical. I have to admit that previously I
wasn’t fully aware of how critical this step is.
>
>> Am 07.02.2024 um 20:22 schrieb Bowen Song via user
:
>>
>> Unfortunately repair doesn't compare each partition
individually. Instead, it groups multiple partitions together and
calculate a hash of them, stores the hash in a leaf of a merkle
tree, and then compares the merkle trees between replicas during a
repair session. If any one of the partitions covered by a leaf is
inconsistent between replicas, the hash values in these leaves
will be different, and all partitions covered by the same leaf
will need to be streamed in full.
>>
>> Knowing that, and also know that your approach can create a
lots of inconsistencies in the repaired SSTables because some
unrepaired SSTables are being marked as repaired on one node but
not on another, you would then understand why over-streaming can
happen. The over-streaming is only problematic for the repaired
SSTables, because they are much bigger than the unrepaired.
>>
>>
>> On 07/02/2024 17:00, Sebastian Marsching wrote:
 Caution, using the method you described, the amount of data
streamed at the end with the full repair is not the amount of data
written between stopping the first node and the last node, but
depends on the table size, the number of partitions written, their
distribution in the ring and the 'repair_session_space' value. If
the table is large, the writes touch a large number of partitions
scattered across the token ring, and the value of
'repair_session_space' is small, you may end up with a very
expensive over-streaming.
>>> Thanks for the warning. In our case it worked well (obviously
we tested it on a test cluster before applying it on the
production clusters), but it is good to know that this might not
always be the case.
>>>
>>> Maybe I misunderstand how full and incremental repairs work in
C* 4.x. I would appreciate if you could clarify this for me.
>>>
>>> So far, I assumed that a full repair on a cluster that is also
using incremental repair pretty much works like on a cluster that
is not using incremental repair at all, the only difference 

Re: Switching to Incremental Repair

2024-02-15 Thread Kristijonas Zalys
Hi folks,

One last question regarding incremental repair.

What would be a safe approach to temporarily stop running incremental
repair on a cluster (e.g.: during a Cassandra major version upgrade)? My
understanding is that if we simply stop running incremental repair, the
cluster's nodes can, in the worst case, double in disk size as the repaired
dataset will not get compacted with the unrepaired dataset. Similar to
Sebastian, we have nodes where the disk usage is multiple TiBs so
significant growth can be quite dangerous in our case. Would the only safe
choice be to mark all SSTables as unrepaired before stopping regular
incremental repair?

Thanks,
Kristijonas


On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user <
user@cassandra.apache.org> wrote:

> The over-streaming is only problematic for the repaired SSTables, but it
> can be triggered by inconsistencies within the unrepaired SSTables
> during an incremental repair session. This is because although an
> incremental repair will only compare the unrepaired SSTables, but it
> will stream both the unrepaired and repaired SSTables for the
> inconsistent token ranges. Keep in mind that the source SSTables for
> streaming is selected based on the token ranges, not the
> repaired/unrepaired state.
>
> Base on the above, I'm unsure running an incremental repair before a
> full repair can fully avoid the over-streaming issue.
>
> On 07/02/2024 22:41, Sebastian Marsching wrote:
> > Thank you very much for your explanation.
> >
> > Streaming happens on the token range level, not the SSTable level,
> right? So, when running an incremental repair before the full repair, the
> problem that “some unrepaired SSTables are being marked as repaired on one
> node but not on another” should not exist any longer. Now this data should
> be marked as repaired on all nodes.
> >
> > Thus, when repairing the SSTables that are marked as repaired, this data
> should be included on all nodes when calculating the Merkle trees and no
> overstreaming should happen.
> >
> > Of course, this means that running an incremental repair *first* after
> marking SSTables as repaired and only running the full repair *after* that
> is critical. I have to admit that previously I wasn’t fully aware of how
> critical this step is.
> >
> >> Am 07.02.2024 um 20:22 schrieb Bowen Song via user <
> user@cassandra.apache.org>:
> >>
> >> Unfortunately repair doesn't compare each partition individually.
> Instead, it groups multiple partitions together and calculate a hash of
> them, stores the hash in a leaf of a merkle tree, and then compares the
> merkle trees between replicas during a repair session. If any one of the
> partitions covered by a leaf is inconsistent between replicas, the hash
> values in these leaves will be different, and all partitions covered by the
> same leaf will need to be streamed in full.
> >>
> >> Knowing that, and also know that your approach can create a lots of
> inconsistencies in the repaired SSTables because some unrepaired SSTables
> are being marked as repaired on one node but not on another, you would then
> understand why over-streaming can happen. The over-streaming is only
> problematic for the repaired SSTables, because they are much bigger than
> the unrepaired.
> >>
> >>
> >> On 07/02/2024 17:00, Sebastian Marsching wrote:
>  Caution, using the method you described, the amount of data streamed
> at the end with the full repair is not the amount of data written between
> stopping the first node and the last node, but depends on the table size,
> the number of partitions written, their distribution in the ring and the
> 'repair_session_space' value. If the table is large, the writes touch a
> large number of partitions scattered across the token ring, and the value
> of 'repair_session_space' is small, you may end up with a very expensive
> over-streaming.
> >>> Thanks for the warning. In our case it worked well (obviously we
> tested it on a test cluster before applying it on the production clusters),
> but it is good to know that this might not always be the case.
> >>>
> >>> Maybe I misunderstand how full and incremental repairs work in C* 4.x.
> I would appreciate if you could clarify this for me.
> >>>
> >>> So far, I assumed that a full repair on a cluster that is also using
> incremental repair pretty much works like on a cluster that is not using
> incremental repair at all, the only difference being that the set of
> repaired und unrepaired data is repaired separately, so the Merkle trees
> that are calculated for repaired and unrepaired data are completely
> separate.
> >>>
> >>> I also assumed that incremental repair only looks at unrepaired data,
> which is why it is so fast.
> >>>
> >>> Is either of these two assumptions wrong?
> >>>
> >>> If not, I do not quite understand how a lot of overstreaming might
> happen, as long as (I forgot to mention this step in my original e-mail) I
> run an incremental repair directly after restarting the nodes