Re: Switching to Incremental Repair

2024-02-15 Thread Chris Lohfink
I would recommend adding something to C* to be able to flip the repaired
state on all sstables quickly (with default OSS can turn nodes off one at a
time and use sstablerepairedset). It's a life saver to be able to revert
back to non-IR if migration going south. Same can be used to quickly switch
into IR sstables with more caveats. Probably worth a jira to add a faster
solution

On Thu, Feb 15, 2024 at 12:50 PM Kristijonas Zalys  wrote:

> Hi folks,
>
> One last question regarding incremental repair.
>
> What would be a safe approach to temporarily stop running incremental
> repair on a cluster (e.g.: during a Cassandra major version upgrade)? My
> understanding is that if we simply stop running incremental repair, the
> cluster's nodes can, in the worst case, double in disk size as the repaired
> dataset will not get compacted with the unrepaired dataset. Similar to
> Sebastian, we have nodes where the disk usage is multiple TiBs so
> significant growth can be quite dangerous in our case. Would the only safe
> choice be to mark all SSTables as unrepaired before stopping regular
> incremental repair?
>
> Thanks,
> Kristijonas
>
>
> On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> The over-streaming is only problematic for the repaired SSTables, but it
>> can be triggered by inconsistencies within the unrepaired SSTables
>> during an incremental repair session. This is because although an
>> incremental repair will only compare the unrepaired SSTables, but it
>> will stream both the unrepaired and repaired SSTables for the
>> inconsistent token ranges. Keep in mind that the source SSTables for
>> streaming is selected based on the token ranges, not the
>> repaired/unrepaired state.
>>
>> Base on the above, I'm unsure running an incremental repair before a
>> full repair can fully avoid the over-streaming issue.
>>
>> On 07/02/2024 22:41, Sebastian Marsching wrote:
>> > Thank you very much for your explanation.
>> >
>> > Streaming happens on the token range level, not the SSTable level,
>> right? So, when running an incremental repair before the full repair, the
>> problem that “some unrepaired SSTables are being marked as repaired on one
>> node but not on another” should not exist any longer. Now this data should
>> be marked as repaired on all nodes.
>> >
>> > Thus, when repairing the SSTables that are marked as repaired, this
>> data should be included on all nodes when calculating the Merkle trees and
>> no overstreaming should happen.
>> >
>> > Of course, this means that running an incremental repair *first* after
>> marking SSTables as repaired and only running the full repair *after* that
>> is critical. I have to admit that previously I wasn’t fully aware of how
>> critical this step is.
>> >
>> >> Am 07.02.2024 um 20:22 schrieb Bowen Song via user <
>> user@cassandra.apache.org>:
>> >>
>> >> Unfortunately repair doesn't compare each partition individually.
>> Instead, it groups multiple partitions together and calculate a hash of
>> them, stores the hash in a leaf of a merkle tree, and then compares the
>> merkle trees between replicas during a repair session. If any one of the
>> partitions covered by a leaf is inconsistent between replicas, the hash
>> values in these leaves will be different, and all partitions covered by the
>> same leaf will need to be streamed in full.
>> >>
>> >> Knowing that, and also know that your approach can create a lots of
>> inconsistencies in the repaired SSTables because some unrepaired SSTables
>> are being marked as repaired on one node but not on another, you would then
>> understand why over-streaming can happen. The over-streaming is only
>> problematic for the repaired SSTables, because they are much bigger than
>> the unrepaired.
>> >>
>> >>
>> >> On 07/02/2024 17:00, Sebastian Marsching wrote:
>>  Caution, using the method you described, the amount of data streamed
>> at the end with the full repair is not the amount of data written between
>> stopping the first node and the last node, but depends on the table size,
>> the number of partitions written, their distribution in the ring and the
>> 'repair_session_space' value. If the table is large, the writes touch a
>> large number of partitions scattered across the token ring, and the value
>> of 'repair_session_space' is small, you may end up with a very expensive
>> over-streaming.
>> >>> Thanks for the warning. In our case it worked well (obviously we
>> tested it on a test cluster before applying it on the production clusters),
>> but it is good to know that this might not always be the case.
>> >>>
>> >>> Maybe I misunderstand how full and incremental repairs work in C*
>> 4.x. I would appreciate if you could clarify this for me.
>> >>>
>> >>> So far, I assumed that a full repair on a cluster that is also using
>> incremental repair pretty much works like on a cluster that is not using
>> incremental repair at all, the only difference being that the set 

Re: Switching to Incremental Repair

2024-02-15 Thread Bowen Song via user
The gc_grace_seconds, which default to 10 days, is the maximal safe 
interval between repairs. How much data gets written during that period 
of time? Will your nodes run out of disk space because of the new data 
written during that time? If so, it sounds like your nodes are 
dangerously close to running out of disk space, and you should address 
that issue first before even considering upgrading Cassandra.


On 15/02/2024 18:49, Kristijonas Zalys wrote:

Hi folks,

One last question regarding incremental repair.

What would be a safe approach to temporarily stop running incremental 
repair on a cluster (e.g.: during a Cassandra major version upgrade)? 
My understanding is that if we simply stop running incremental repair, 
the cluster's nodes can, in the worst case, double in disk size as the 
repaired dataset will not get compacted with the unrepaired dataset. 
Similar to Sebastian, we have nodes where the disk usage is multiple 
TiBs so significant growth can be quite dangerous in our case. Would 
the only safe choice be to mark all SSTables as unrepaired before 
stopping regular incremental repair?


Thanks,
Kristijonas


On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user 
 wrote:


The over-streaming is only problematic for the repaired SSTables,
but it
can be triggered by inconsistencies within the unrepaired SSTables
during an incremental repair session. This is because although an
incremental repair will only compare the unrepaired SSTables, but it
will stream both the unrepaired and repaired SSTables for the
inconsistent token ranges. Keep in mind that the source SSTables for
streaming is selected based on the token ranges, not the
repaired/unrepaired state.

Base on the above, I'm unsure running an incremental repair before a
full repair can fully avoid the over-streaming issue.

On 07/02/2024 22:41, Sebastian Marsching wrote:
> Thank you very much for your explanation.
>
> Streaming happens on the token range level, not the SSTable
level, right? So, when running an incremental repair before the
full repair, the problem that “some unrepaired SSTables are being
marked as repaired on one node but not on another” should not
exist any longer. Now this data should be marked as repaired on
all nodes.
>
> Thus, when repairing the SSTables that are marked as repaired,
this data should be included on all nodes when calculating the
Merkle trees and no overstreaming should happen.
>
> Of course, this means that running an incremental repair *first*
after marking SSTables as repaired and only running the full
repair *after* that is critical. I have to admit that previously I
wasn’t fully aware of how critical this step is.
>
>> Am 07.02.2024 um 20:22 schrieb Bowen Song via user
:
>>
>> Unfortunately repair doesn't compare each partition
individually. Instead, it groups multiple partitions together and
calculate a hash of them, stores the hash in a leaf of a merkle
tree, and then compares the merkle trees between replicas during a
repair session. If any one of the partitions covered by a leaf is
inconsistent between replicas, the hash values in these leaves
will be different, and all partitions covered by the same leaf
will need to be streamed in full.
>>
>> Knowing that, and also know that your approach can create a
lots of inconsistencies in the repaired SSTables because some
unrepaired SSTables are being marked as repaired on one node but
not on another, you would then understand why over-streaming can
happen. The over-streaming is only problematic for the repaired
SSTables, because they are much bigger than the unrepaired.
>>
>>
>> On 07/02/2024 17:00, Sebastian Marsching wrote:
 Caution, using the method you described, the amount of data
streamed at the end with the full repair is not the amount of data
written between stopping the first node and the last node, but
depends on the table size, the number of partitions written, their
distribution in the ring and the 'repair_session_space' value. If
the table is large, the writes touch a large number of partitions
scattered across the token ring, and the value of
'repair_session_space' is small, you may end up with a very
expensive over-streaming.
>>> Thanks for the warning. In our case it worked well (obviously
we tested it on a test cluster before applying it on the
production clusters), but it is good to know that this might not
always be the case.
>>>
>>> Maybe I misunderstand how full and incremental repairs work in
C* 4.x. I would appreciate if you could clarify this for me.
>>>
>>> So far, I assumed that a full repair on a cluster that is also
using incremental repair pretty much works like on a cluster that
is not using incremental repair at all, the only difference 

Re: Switching to Incremental Repair

2024-02-15 Thread Kristijonas Zalys
Hi folks,

One last question regarding incremental repair.

What would be a safe approach to temporarily stop running incremental
repair on a cluster (e.g.: during a Cassandra major version upgrade)? My
understanding is that if we simply stop running incremental repair, the
cluster's nodes can, in the worst case, double in disk size as the repaired
dataset will not get compacted with the unrepaired dataset. Similar to
Sebastian, we have nodes where the disk usage is multiple TiBs so
significant growth can be quite dangerous in our case. Would the only safe
choice be to mark all SSTables as unrepaired before stopping regular
incremental repair?

Thanks,
Kristijonas


On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user <
user@cassandra.apache.org> wrote:

> The over-streaming is only problematic for the repaired SSTables, but it
> can be triggered by inconsistencies within the unrepaired SSTables
> during an incremental repair session. This is because although an
> incremental repair will only compare the unrepaired SSTables, but it
> will stream both the unrepaired and repaired SSTables for the
> inconsistent token ranges. Keep in mind that the source SSTables for
> streaming is selected based on the token ranges, not the
> repaired/unrepaired state.
>
> Base on the above, I'm unsure running an incremental repair before a
> full repair can fully avoid the over-streaming issue.
>
> On 07/02/2024 22:41, Sebastian Marsching wrote:
> > Thank you very much for your explanation.
> >
> > Streaming happens on the token range level, not the SSTable level,
> right? So, when running an incremental repair before the full repair, the
> problem that “some unrepaired SSTables are being marked as repaired on one
> node but not on another” should not exist any longer. Now this data should
> be marked as repaired on all nodes.
> >
> > Thus, when repairing the SSTables that are marked as repaired, this data
> should be included on all nodes when calculating the Merkle trees and no
> overstreaming should happen.
> >
> > Of course, this means that running an incremental repair *first* after
> marking SSTables as repaired and only running the full repair *after* that
> is critical. I have to admit that previously I wasn’t fully aware of how
> critical this step is.
> >
> >> Am 07.02.2024 um 20:22 schrieb Bowen Song via user <
> user@cassandra.apache.org>:
> >>
> >> Unfortunately repair doesn't compare each partition individually.
> Instead, it groups multiple partitions together and calculate a hash of
> them, stores the hash in a leaf of a merkle tree, and then compares the
> merkle trees between replicas during a repair session. If any one of the
> partitions covered by a leaf is inconsistent between replicas, the hash
> values in these leaves will be different, and all partitions covered by the
> same leaf will need to be streamed in full.
> >>
> >> Knowing that, and also know that your approach can create a lots of
> inconsistencies in the repaired SSTables because some unrepaired SSTables
> are being marked as repaired on one node but not on another, you would then
> understand why over-streaming can happen. The over-streaming is only
> problematic for the repaired SSTables, because they are much bigger than
> the unrepaired.
> >>
> >>
> >> On 07/02/2024 17:00, Sebastian Marsching wrote:
>  Caution, using the method you described, the amount of data streamed
> at the end with the full repair is not the amount of data written between
> stopping the first node and the last node, but depends on the table size,
> the number of partitions written, their distribution in the ring and the
> 'repair_session_space' value. If the table is large, the writes touch a
> large number of partitions scattered across the token ring, and the value
> of 'repair_session_space' is small, you may end up with a very expensive
> over-streaming.
> >>> Thanks for the warning. In our case it worked well (obviously we
> tested it on a test cluster before applying it on the production clusters),
> but it is good to know that this might not always be the case.
> >>>
> >>> Maybe I misunderstand how full and incremental repairs work in C* 4.x.
> I would appreciate if you could clarify this for me.
> >>>
> >>> So far, I assumed that a full repair on a cluster that is also using
> incremental repair pretty much works like on a cluster that is not using
> incremental repair at all, the only difference being that the set of
> repaired und unrepaired data is repaired separately, so the Merkle trees
> that are calculated for repaired and unrepaired data are completely
> separate.
> >>>
> >>> I also assumed that incremental repair only looks at unrepaired data,
> which is why it is so fast.
> >>>
> >>> Is either of these two assumptions wrong?
> >>>
> >>> If not, I do not quite understand how a lot of overstreaming might
> happen, as long as (I forgot to mention this step in my original e-mail) I
> run an incremental repair directly after restarting the nodes 

Re: Switching to Incremental Repair

2024-02-07 Thread Bowen Song via user
The over-streaming is only problematic for the repaired SSTables, but it 
can be triggered by inconsistencies within the unrepaired SSTables 
during an incremental repair session. This is because although an 
incremental repair will only compare the unrepaired SSTables, but it 
will stream both the unrepaired and repaired SSTables for the 
inconsistent token ranges. Keep in mind that the source SSTables for 
streaming is selected based on the token ranges, not the 
repaired/unrepaired state.


Base on the above, I'm unsure running an incremental repair before a 
full repair can fully avoid the over-streaming issue.


On 07/02/2024 22:41, Sebastian Marsching wrote:

Thank you very much for your explanation.

Streaming happens on the token range level, not the SSTable level, right? So, 
when running an incremental repair before the full repair, the problem that 
“some unrepaired SSTables are being marked as repaired on one node but not on 
another” should not exist any longer. Now this data should be marked as 
repaired on all nodes.

Thus, when repairing the SSTables that are marked as repaired, this data should 
be included on all nodes when calculating the Merkle trees and no overstreaming 
should happen.

Of course, this means that running an incremental repair *first* after marking 
SSTables as repaired and only running the full repair *after* that is critical. 
I have to admit that previously I wasn’t fully aware of how critical this step 
is.


Am 07.02.2024 um 20:22 schrieb Bowen Song via user :

Unfortunately repair doesn't compare each partition individually. Instead, it 
groups multiple partitions together and calculate a hash of them, stores the 
hash in a leaf of a merkle tree, and then compares the merkle trees between 
replicas during a repair session. If any one of the partitions covered by a 
leaf is inconsistent between replicas, the hash values in these leaves will be 
different, and all partitions covered by the same leaf will need to be streamed 
in full.

Knowing that, and also know that your approach can create a lots of 
inconsistencies in the repaired SSTables because some unrepaired SSTables are 
being marked as repaired on one node but not on another, you would then 
understand why over-streaming can happen. The over-streaming is only 
problematic for the repaired SSTables, because they are much bigger than the 
unrepaired.


On 07/02/2024 17:00, Sebastian Marsching wrote:

Caution, using the method you described, the amount of data streamed at the end 
with the full repair is not the amount of data written between stopping the 
first node and the last node, but depends on the table size, the number of 
partitions written, their distribution in the ring and the 
'repair_session_space' value. If the table is large, the writes touch a large 
number of partitions scattered across the token ring, and the value of 
'repair_session_space' is small, you may end up with a very expensive 
over-streaming.

Thanks for the warning. In our case it worked well (obviously we tested it on a 
test cluster before applying it on the production clusters), but it is good to 
know that this might not always be the case.

Maybe I misunderstand how full and incremental repairs work in C* 4.x. I would 
appreciate if you could clarify this for me.

So far, I assumed that a full repair on a cluster that is also using 
incremental repair pretty much works like on a cluster that is not using 
incremental repair at all, the only difference being that the set of repaired 
und unrepaired data is repaired separately, so the Merkle trees that are 
calculated for repaired and unrepaired data are completely separate.

I also assumed that incremental repair only looks at unrepaired data, which is 
why it is so fast.

Is either of these two assumptions wrong?

If not, I do not quite understand how a lot of overstreaming might happen, as 
long as (I forgot to mention this step in my original e-mail) I run an 
incremental repair directly after restarting the nodes and marking all data as 
repaired.

I understand that significant overstreaming might happen during this first 
repair (in the worst case streaming all the unrepaired data that a node 
stores), but due to the short amount of time between starting to mark data as 
repaired and running the incremental repair, the whole set of unrepaired data 
should be rather small, so this overstreaming should not cause any issues.

 From this point on, the unrepaired data on the different nodes should be in 
sync and discrepancies in the repaired data during the full repair should not 
be bigger than they had been if I had run a full repair without marking an data 
as repaired.

I would really appreciate if you could point out the hole in this reasoning. 
Maybe I have a fundamentally wrong understanding of the repair process, and if 
I do I would like to correct this.



Re: Switching to Incremental Repair

2024-02-07 Thread Sebastian Marsching

Thank you very much for your explanation.

Streaming happens on the token range level, not the SSTable level, right? So, 
when running an incremental repair before the full repair, the problem that 
“some unrepaired SSTables are being marked as repaired on one node but not on 
another” should not exist any longer. Now this data should be marked as 
repaired on all nodes.

Thus, when repairing the SSTables that are marked as repaired, this data should 
be included on all nodes when calculating the Merkle trees and no overstreaming 
should happen.

Of course, this means that running an incremental repair *first* after marking 
SSTables as repaired and only running the full repair *after* that is critical. 
I have to admit that previously I wasn’t fully aware of how critical this step 
is.

> Am 07.02.2024 um 20:22 schrieb Bowen Song via user 
> :
>
> Unfortunately repair doesn't compare each partition individually. Instead, it 
> groups multiple partitions together and calculate a hash of them, stores the 
> hash in a leaf of a merkle tree, and then compares the merkle trees between 
> replicas during a repair session. If any one of the partitions covered by a 
> leaf is inconsistent between replicas, the hash values in these leaves will 
> be different, and all partitions covered by the same leaf will need to be 
> streamed in full.
>
> Knowing that, and also know that your approach can create a lots of 
> inconsistencies in the repaired SSTables because some unrepaired SSTables are 
> being marked as repaired on one node but not on another, you would then 
> understand why over-streaming can happen. The over-streaming is only 
> problematic for the repaired SSTables, because they are much bigger than the 
> unrepaired.
>
>
> On 07/02/2024 17:00, Sebastian Marsching wrote:
>>> Caution, using the method you described, the amount of data streamed at the 
>>> end with the full repair is not the amount of data written between stopping 
>>> the first node and the last node, but depends on the table size, the number 
>>> of partitions written, their distribution in the ring and the 
>>> 'repair_session_space' value. If the table is large, the writes touch a 
>>> large number of partitions scattered across the token ring, and the value 
>>> of 'repair_session_space' is small, you may end up with a very expensive 
>>> over-streaming.
>> Thanks for the warning. In our case it worked well (obviously we tested it 
>> on a test cluster before applying it on the production clusters), but it is 
>> good to know that this might not always be the case.
>>
>> Maybe I misunderstand how full and incremental repairs work in C* 4.x. I 
>> would appreciate if you could clarify this for me.
>>
>> So far, I assumed that a full repair on a cluster that is also using 
>> incremental repair pretty much works like on a cluster that is not using 
>> incremental repair at all, the only difference being that the set of 
>> repaired und unrepaired data is repaired separately, so the Merkle trees 
>> that are calculated for repaired and unrepaired data are completely separate.
>>
>> I also assumed that incremental repair only looks at unrepaired data, which 
>> is why it is so fast.
>>
>> Is either of these two assumptions wrong?
>>
>> If not, I do not quite understand how a lot of overstreaming might happen, 
>> as long as (I forgot to mention this step in my original e-mail) I run an 
>> incremental repair directly after restarting the nodes and marking all data 
>> as repaired.
>>
>> I understand that significant overstreaming might happen during this first 
>> repair (in the worst case streaming all the unrepaired data that a node 
>> stores), but due to the short amount of time between starting to mark data 
>> as repaired and running the incremental repair, the whole set of unrepaired 
>> data should be rather small, so this overstreaming should not cause any 
>> issues.
>>
>> From this point on, the unrepaired data on the different nodes should be in 
>> sync and discrepancies in the repaired data during the full repair should 
>> not be bigger than they had been if I had run a full repair without marking 
>> an data as repaired.
>>
>> I would really appreciate if you could point out the hole in this reasoning. 
>> Maybe I have a fundamentally wrong understanding of the repair process, and 
>> if I do I would like to correct this.
>>
>



smime.p7s
Description: S/MIME cryptographic signature


Re: Switching to Incremental Repair

2024-02-07 Thread Bowen Song via user
Unfortunately repair doesn't compare each partition individually. 
Instead, it groups multiple partitions together and calculate a hash of 
them, stores the hash in a leaf of a merkle tree, and then compares the 
merkle trees between replicas during a repair session. If any one of the 
partitions covered by a leaf is inconsistent between replicas, the hash 
values in these leaves will be different, and all partitions covered by 
the same leaf will need to be streamed in full.


Knowing that, and also know that your approach can create a lots of 
inconsistencies in the repaired SSTables because some unrepaired 
SSTables are being marked as repaired on one node but not on another, 
you would then understand why over-streaming can happen. The 
over-streaming is only problematic for the repaired SSTables, because 
they are much bigger than the unrepaired.



On 07/02/2024 17:00, Sebastian Marsching wrote:

Caution, using the method you described, the amount of data streamed at the end 
with the full repair is not the amount of data written between stopping the 
first node and the last node, but depends on the table size, the number of 
partitions written, their distribution in the ring and the 
'repair_session_space' value. If the table is large, the writes touch a large 
number of partitions scattered across the token ring, and the value of 
'repair_session_space' is small, you may end up with a very expensive 
over-streaming.

Thanks for the warning. In our case it worked well (obviously we tested it on a 
test cluster before applying it on the production clusters), but it is good to 
know that this might not always be the case.

Maybe I misunderstand how full and incremental repairs work in C* 4.x. I would 
appreciate if you could clarify this for me.

So far, I assumed that a full repair on a cluster that is also using 
incremental repair pretty much works like on a cluster that is not using 
incremental repair at all, the only difference being that the set of repaired 
und unrepaired data is repaired separately, so the Merkle trees that are 
calculated for repaired and unrepaired data are completely separate.

I also assumed that incremental repair only looks at unrepaired data, which is 
why it is so fast.

Is either of these two assumptions wrong?

If not, I do not quite understand how a lot of overstreaming might happen, as 
long as (I forgot to mention this step in my original e-mail) I run an 
incremental repair directly after restarting the nodes and marking all data as 
repaired.

I understand that significant overstreaming might happen during this first 
repair (in the worst case streaming all the unrepaired data that a node 
stores), but due to the short amount of time between starting to mark data as 
repaired and running the incremental repair, the whole set of unrepaired data 
should be rather small, so this overstreaming should not cause any issues.

 From this point on, the unrepaired data on the different nodes should be in 
sync and discrepancies in the repaired data during the full repair should not 
be bigger than they had been if I had run a full repair without marking an data 
as repaired.

I would really appreciate if you could point out the hole in this reasoning. 
Maybe I have a fundamentally wrong understanding of the repair process, and if 
I do I would like to correct this.



Re: Switching to Incremental Repair

2024-02-07 Thread Sebastian Marsching

> Caution, using the method you described, the amount of data streamed at the 
> end with the full repair is not the amount of data written between stopping 
> the first node and the last node, but depends on the table size, the number 
> of partitions written, their distribution in the ring and the 
> 'repair_session_space' value. If the table is large, the writes touch a large 
> number of partitions scattered across the token ring, and the value of 
> 'repair_session_space' is small, you may end up with a very expensive 
> over-streaming.

Thanks for the warning. In our case it worked well (obviously we tested it on a 
test cluster before applying it on the production clusters), but it is good to 
know that this might not always be the case.

Maybe I misunderstand how full and incremental repairs work in C* 4.x. I would 
appreciate if you could clarify this for me.

So far, I assumed that a full repair on a cluster that is also using 
incremental repair pretty much works like on a cluster that is not using 
incremental repair at all, the only difference being that the set of repaired 
und unrepaired data is repaired separately, so the Merkle trees that are 
calculated for repaired and unrepaired data are completely separate.

I also assumed that incremental repair only looks at unrepaired data, which is 
why it is so fast.

Is either of these two assumptions wrong?

If not, I do not quite understand how a lot of overstreaming might happen, as 
long as (I forgot to mention this step in my original e-mail) I run an 
incremental repair directly after restarting the nodes and marking all data as 
repaired.

I understand that significant overstreaming might happen during this first 
repair (in the worst case streaming all the unrepaired data that a node 
stores), but due to the short amount of time between starting to mark data as 
repaired and running the incremental repair, the whole set of unrepaired data 
should be rather small, so this overstreaming should not cause any issues.

From this point on, the unrepaired data on the different nodes should be in 
sync and discrepancies in the repaired data during the full repair should not 
be bigger than they had been if I had run a full repair without marking an data 
as repaired.

I would really appreciate if you could point out the hole in this reasoning. 
Maybe I have a fundamentally wrong understanding of the repair process, and if 
I do I would like to correct this.



smime.p7s
Description: S/MIME cryptographic signature


Re: Switching to Incremental Repair

2024-02-07 Thread Bowen Song via user
Caution, using the method you described, the amount of data streamed at 
the end with the full repair is not the amount of data written between 
stopping the first node and the last node, but depends on the table 
size, the number of partitions written, their distribution in the ring 
and the 'repair_session_space' value. If the table is large, the writes 
touch a large number of partitions scattered across the token ring, and 
the value of 'repair_session_space' is small, you may end up with a very 
expensive over-streaming.


On 07/02/2024 12:33, Sebastian Marsching wrote:
Full repair running for an entire week sounds excessively long. Even 
if you've got 1 TB of data per node, 1 week means the repair speed is 
less than 2 MB/s, that's very slow. Perhaps you should focus on 
finding the bottleneck of the full repair speed and work on that instead.


We store about 3–3.5 TB per node on spinning disks (time-series data), 
so I don’t think it is too surprising.


Not disabling auto-compaction may result in repaired SSTables getting 
compacted together with unrepaired SSTables before the repair state 
is set on them, which leads to mismatch in the repaired data between 
nodes, and potentially very expensive over-streaming in a future full 
repair. You should follow the documented and tested steps and not 
improvise or getting creative if you value your data and time.


There is a different method that we successfully used on three 
clusters, but I agree that anti-entropy repair is a tricky business 
and one should be cautious with trying less tested methods.


Due to the long time for a full repair (see my earlier explanation), 
disabling anticompaction while running the full repair wasn’t an 
option for us. It was previously suggested that one could run the 
repair per node instead of the full cluster, but I don’t think that 
this will work, because only marking the SSTables on a single node as 
repaired would lead to massive overstreaming when running the full 
repair for the next node that shares data with the first one.


So, I want to describe the method that we used, just in case someone 
is in the same situation:


Going around the ring, we temporarily stopped each node and marked all 
of its SSTables as repaired. Then we immediately ran a full repair, so 
that any inconsistencies in the data that was now marked as repaired 
but not actually repaired were fixed.


Using this approach, the amount over over-streaming is minimal (at 
least for not too large clusters, where the rolling restart can be 
done in less than an hour or so), because the only difference between 
the “unrepaired” SSTables on the different nodes will be the data that 
was written before stopping the first node and stopping the last node.


Any inconsistencies that might exist in the SSTables that were marked 
as repaired should be caught in the full repair, so I do not think it 
is too dangerous either. However, I agree that for clusters where a 
full repair is quick (e.g. finishes in a few hours), using the 
well-tested and frequently used approach is probably better.


Re: Switching to Incremental Repair

2024-02-07 Thread Sebastian Marsching

> That's a feature we need to implement in Reaper. I think disallowing the 
> start of the new incremental repair would be easier to manage than pausing 
> the full repair that's already running. It's also what I think I'd expect as 
> a user.
>
> I'll create an issue to track this.

Thank you, Alexander, that’s great!

I was considering the other approach (pausing the full repair in order to be 
able to start the incremental repair) because this is what I have been doing 
manually in the past few days. Due to full repairs taking a lot of time for us 
(see my other e-mail), I didn’t want too much unrepaired data too accumulate 
over time.

However, I guess that this is a niche use case, and in most cases inhibiting 
the incremental repair is the correct and expected approach, so I wouldn’t 
expect such a feature in Cassandra Reaper.

For our use case, I am considering abandoning the scheduling feature of Reaper 
and instead writing a simple script that schedules repairs through Reaper’s 
API. This will also give us an easier way of staggering different repair jobs 
instead of having to rely on choosing the start time correctly in order to get 
the desired effect. Doing all this in a custom script is probably much, much 
easier than trying to implement it as a generic, user-configurable feature in 
Reaper.



smime.p7s
Description: S/MIME cryptographic signature


Re: Switching to Incremental Repair

2024-02-07 Thread Sebastian Marsching
> Full repair running for an entire week sounds excessively long. Even if 
> you've got 1 TB of data per node, 1 week means the repair speed is less than 
> 2 MB/s, that's very slow. Perhaps you should focus on finding the bottleneck 
> of the full repair speed and work on that instead.

We store about 3–3.5 TB per node on spinning disks (time-series data), so I 
don’t think it is too surprising.
> Not disabling auto-compaction may result in repaired SSTables getting 
> compacted together with unrepaired SSTables before the repair state is set on 
> them, which leads to mismatch in the repaired data between nodes, and 
> potentially very expensive over-streaming in a future full repair. You should 
> follow the documented and tested steps and not improvise or getting creative 
> if you value your data and time.
> 
There is a different method that we successfully used on three clusters, but I 
agree that anti-entropy repair is a tricky business and one should be cautious 
with trying less tested methods.

Due to the long time for a full repair (see my earlier explanation), disabling 
anticompaction while running the full repair wasn’t an option for us. It was 
previously suggested that one could run the repair per node instead of the full 
cluster, but I don’t think that this will work, because only marking the 
SSTables on a single node as repaired would lead to massive overstreaming when 
running the full repair for the next node that shares data with the first one.

So, I want to describe the method that we used, just in case someone is in the 
same situation:

Going around the ring, we temporarily stopped each node and marked all of its 
SSTables as repaired. Then we immediately ran a full repair, so that any 
inconsistencies in the data that was now marked as repaired but not actually 
repaired were fixed.

Using this approach, the amount over over-streaming is minimal (at least for 
not too large clusters, where the rolling restart can be done in less than an 
hour or so), because the only difference between the “unrepaired” SSTables on 
the different nodes will be the data that was written before stopping the first 
node and stopping the last node.

Any inconsistencies that might exist in the SSTables that were marked as 
repaired should be caught in the full repair, so I do not think it is too 
dangerous either. However, I agree that for clusters where a full repair is 
quick (e.g. finishes in a few hours), using the well-tested and frequently used 
approach is probably better.

smime.p7s
Description: S/MIME cryptographic signature


Re: Switching to Incremental Repair

2024-02-07 Thread Bowen Song via user
Just one more thing. Make sure you run 'nodetool repair -full' instead 
of just 'nodetool repair'. That's because the command's default was 
changed in Cassandra 2.x. The default was full repair before that 
change, but the new default now is incremental repair.


On 07/02/2024 10:28, Bowen Song via user wrote:


Not disabling auto-compaction may result in repaired SSTables getting 
compacted together with unrepaired SSTables before the repair state is 
set on them, which leads to mismatch in the repaired data between 
nodes, and potentially very expensive over-streaming in a future full 
repair. You should follow the documented and tested steps and not 
improvise or getting creative if you value your data and time.


On 06/02/2024 23:55, Kristijonas Zalys wrote:


Hi folks,


Thank you all for your insight, this has been very helpful.


I was going through the migration process here 
and 
I’m not entirely sure why disabling autocompaction on the node is 
required? Could anyone clarify what would be the side effects of not 
disabling autocompaction and starting with step 2 of the migration?



Thanks,

Kristijonas



On Sun, Feb 4, 2024 at 12:18 AM Alexander DEJANOVSKI 
 wrote:


Hi Sebastian,

That's a feature we need to implement in Reaper. I think
disallowing the start of the new incremental repair would be
easier to manage than pausing the full repair that's already
running. It's also what I think I'd expect as a user.

I'll create an issue to track this.

Le sam. 3 févr. 2024, 16:19, Sebastian Marsching
 a écrit :

Hi,


2. use an orchestration tool, such as Cassandra Reaper, to
take care of that for you. You will still need monitor and
alert to ensure the repairs are run successfully, but fixing
a stuck or failed repair is not very time sensitive, you can
usually leave it till Monday morning if it happens at Friday
night.


Does anyone know how such a schedule can be created in
Cassandra Reaper?

I recently learned the hard way that running both a full and
an incremental repair for the same keyspace and table in
parallel is not a good idea (it caused a very unpleasant
overload situation on one of our clusters).

At the moment, we have one schedule for the full repairs
(every 90 days) and another schedule for the incremental
repairs (daily). But as full repairs take much longer than a
day (about a week, in our case), the two schedules collide.
So, Cassandra Reaper starts an incremental repair while the
full repair is still in process.

Does anyone know how to avoid this? Optimally, the full
repair would be paused (no new segments started) for the
duration of the incremental repair. The second best option
would be inhibiting the incremental repair while a full
repair is in progress.

Best regards,
Sebastian


Re: Switching to Incremental Repair

2024-02-07 Thread Bowen Song via user
Not disabling auto-compaction may result in repaired SSTables getting 
compacted together with unrepaired SSTables before the repair state is 
set on them, which leads to mismatch in the repaired data between nodes, 
and potentially very expensive over-streaming in a future full repair. 
You should follow the documented and tested steps and not improvise or 
getting creative if you value your data and time.


On 06/02/2024 23:55, Kristijonas Zalys wrote:


Hi folks,


Thank you all for your insight, this has been very helpful.


I was going through the migration process here 
and 
I’m not entirely sure why disabling autocompaction on the node is 
required? Could anyone clarify what would be the side effects of not 
disabling autocompaction and starting with step 2 of the migration?



Thanks,

Kristijonas



On Sun, Feb 4, 2024 at 12:18 AM Alexander DEJANOVSKI 
 wrote:


Hi Sebastian,

That's a feature we need to implement in Reaper. I think
disallowing the start of the new incremental repair would be
easier to manage than pausing the full repair that's already
running. It's also what I think I'd expect as a user.

I'll create an issue to track this.

Le sam. 3 févr. 2024, 16:19, Sebastian Marsching
 a écrit :

Hi,


2. use an orchestration tool, such as Cassandra Reaper, to
take care of that for you. You will still need monitor and
alert to ensure the repairs are run successfully, but fixing
a stuck or failed repair is not very time sensitive, you can
usually leave it till Monday morning if it happens at Friday
night.


Does anyone know how such a schedule can be created in
Cassandra Reaper?

I recently learned the hard way that running both a full and
an incremental repair for the same keyspace and table in
parallel is not a good idea (it caused a very unpleasant
overload situation on one of our clusters).

At the moment, we have one schedule for the full repairs
(every 90 days) and another schedule for the incremental
repairs (daily). But as full repairs take much longer than a
day (about a week, in our case), the two schedules collide.
So, Cassandra Reaper starts an incremental repair while the
full repair is still in process.

Does anyone know how to avoid this? Optimally, the full repair
would be paused (no new segments started) for the duration of
the incremental repair. The second best option would be
inhibiting the incremental repair while a full repair is in
progress.

Best regards,
Sebastian


Re: Switching to Incremental Repair

2024-02-06 Thread Kristijonas Zalys
Hi folks,

Thank you all for your insight, this has been very helpful.

I was going through the migration process here

and I’m not entirely sure why disabling autocompaction on the node is
required? Could anyone clarify what would be the side effects of not
disabling autocompaction and starting with step 2 of the migration?

Thanks,

Kristijonas


On Sun, Feb 4, 2024 at 12:18 AM Alexander DEJANOVSKI 
wrote:

> Hi Sebastian,
>
> That's a feature we need to implement in Reaper. I think disallowing the
> start of the new incremental repair would be easier to manage than pausing
> the full repair that's already running. It's also what I think I'd expect
> as a user.
>
> I'll create an issue to track this.
>
> Le sam. 3 févr. 2024, 16:19, Sebastian Marsching 
> a écrit :
>
>> Hi,
>>
>> 2. use an orchestration tool, such as Cassandra Reaper, to take care of
>> that for you. You will still need monitor and alert to ensure the repairs
>> are run successfully, but fixing a stuck or failed repair is not very time
>> sensitive, you can usually leave it till Monday morning if it happens at
>> Friday night.
>>
>> Does anyone know how such a schedule can be created in Cassandra Reaper?
>>
>> I recently learned the hard way that running both a full and an
>> incremental repair for the same keyspace and table in parallel is not a
>> good idea (it caused a very unpleasant overload situation on one of our
>> clusters).
>>
>> At the moment, we have one schedule for the full repairs (every 90 days)
>> and another schedule for the incremental repairs (daily). But as full
>> repairs take much longer than a day (about a week, in our case), the two
>> schedules collide. So, Cassandra Reaper starts an incremental repair while
>> the full repair is still in process.
>>
>> Does anyone know how to avoid this? Optimally, the full repair would be
>> paused (no new segments started) for the duration of the incremental
>> repair. The second best option would be inhibiting the incremental repair
>> while a full repair is in progress.
>>
>> Best regards,
>> Sebastian
>>
>>


Re: Switching to Incremental Repair

2024-02-04 Thread Alexander DEJANOVSKI
Hi Sebastian,

That's a feature we need to implement in Reaper. I think disallowing the
start of the new incremental repair would be easier to manage than pausing
the full repair that's already running. It's also what I think I'd expect
as a user.

I'll create an issue to track this.

Le sam. 3 févr. 2024, 16:19, Sebastian Marsching 
a écrit :

> Hi,
>
> 2. use an orchestration tool, such as Cassandra Reaper, to take care of
> that for you. You will still need monitor and alert to ensure the repairs
> are run successfully, but fixing a stuck or failed repair is not very time
> sensitive, you can usually leave it till Monday morning if it happens at
> Friday night.
>
> Does anyone know how such a schedule can be created in Cassandra Reaper?
>
> I recently learned the hard way that running both a full and an
> incremental repair for the same keyspace and table in parallel is not a
> good idea (it caused a very unpleasant overload situation on one of our
> clusters).
>
> At the moment, we have one schedule for the full repairs (every 90 days)
> and another schedule for the incremental repairs (daily). But as full
> repairs take much longer than a day (about a week, in our case), the two
> schedules collide. So, Cassandra Reaper starts an incremental repair while
> the full repair is still in process.
>
> Does anyone know how to avoid this? Optimally, the full repair would be
> paused (no new segments started) for the duration of the incremental
> repair. The second best option would be inhibiting the incremental repair
> while a full repair is in progress.
>
> Best regards,
> Sebastian
>
>


Re: Switching to Incremental Repair

2024-02-03 Thread Bowen Song via user
Full repair running for an entire week sounds excessively long. Even if 
you've got 1 TB of data per node, 1 week means the repair speed is less 
than 2 MB/s, that's very slow. Perhaps you should focus on finding the 
bottleneck of the full repair speed and work on that instead.



On 03/02/2024 16:18, Sebastian Marsching wrote:

Hi,


2. use an orchestration tool, such as Cassandra Reaper, to take care 
of that for you. You will still need monitor and alert to ensure the 
repairs are run successfully, but fixing a stuck or failed repair is 
not very time sensitive, you can usually leave it till Monday morning 
if it happens at Friday night.



Does anyone know how such a schedule can be created in Cassandra Reaper?

I recently learned the hard way that running both a full and an 
incremental repair for the same keyspace and table in parallel is not 
a good idea (it caused a very unpleasant overload situation on one of 
our clusters).


At the moment, we have one schedule for the full repairs (every 90 
days) and another schedule for the incremental repairs (daily). But as 
full repairs take much longer than a day (about a week, in our case), 
the two schedules collide. So, Cassandra Reaper starts an incremental 
repair while the full repair is still in process.


Does anyone know how to avoid this? Optimally, the full repair would 
be paused (no new segments started) for the duration of the 
incremental repair. The second best option would be inhibiting the 
incremental repair while a full repair is in progress.


Best regards,
Sebastian



Re: Switching to Incremental Repair

2024-02-03 Thread Sebastian Marsching
Hi,
> 2. use an orchestration tool, such as Cassandra Reaper, to take care of that 
> for you. You will still need monitor and alert to ensure the repairs are run 
> successfully, but fixing a stuck or failed repair is not very time sensitive, 
> you can usually leave it till Monday morning if it happens at Friday night.
> 
Does anyone know how such a schedule can be created in Cassandra Reaper?

I recently learned the hard way that running both a full and an incremental 
repair for the same keyspace and table in parallel is not a good idea (it 
caused a very unpleasant overload situation on one of our clusters).

At the moment, we have one schedule for the full repairs (every 90 days) and 
another schedule for the incremental repairs (daily). But as full repairs take 
much longer than a day (about a week, in our case), the two schedules collide. 
So, Cassandra Reaper starts an incremental repair while the full repair is 
still in process.

Does anyone know how to avoid this? Optimally, the full repair would be paused 
(no new segments started) for the duration of the incremental repair. The 
second best option would be inhibiting the incremental repair while a full 
repair is in progress.

Best regards,
Sebastian



smime.p7s
Description: S/MIME cryptographic signature


Re: Switching to Incremental Repair

2024-02-03 Thread Bowen Song via user

Hi Kristijonas,

It is not possible to run two repairs, regardless whether they are 
incremental or full, for the same token range and on the same table 
concurrently. You have two options:


1. create a schedule that's don't overlap, e.g. run incremental repair 
daily except the 1st of each month, and run full repair on the 1st of 
each month. If you choose to do this, make sure you setup a monitor and 
alert system for it and have someone respond to the alerts in weekends 
or public holidays. If a repair took longer than usual and is at the 
risk of overlapping with the next repair, a timely human intervention is 
required to prevent that - either kill the currently running repair or 
skip the next one.


2. use an orchestration tool, such as Cassandra Reaper, to take care of 
that for you. You will still need monitor and alert to ensure the 
repairs are run successfully, but fixing a stuck or failed repair is not 
very time sensitive, you can usually leave it till Monday morning if it 
happens at Friday night.


Personally I would recommend the 2nd option, because getting back to 
your laptop at 10 pm on Friday night after you have had a few beers is 
not fun.


Cheers,
Bowen

On 03/02/2024 01:59, Kristijonas Zalys wrote:

Hi Bowen,

Thank you for your help!

So given that we would need to run both incremental and full repair 
for a given cluster, is it safe to have both types of repair running 
for the same token ranges at the same time? Would it not create a race 
condition?


Thanks,
Kristijonas

On Fri, Feb 2, 2024 at 3:36 PM Bowen Song via user 
 wrote:


Hi Kristijonas,

To answer your questions:

1. It's still necessary to run full repair on a cluster on which
incremental repair is run periodically. The frequency of full
repair is more of an art than science. Generally speaking, the
less reliable the storage media, the more frequently full repair
should be run. The documentation on this topic is available here



2. Run incremental repair for the first time on an existing
cluster does cause Cassandra to re-compact all SSTables, and can
lead to disk usage spikes. This can be avoided by following the
steps mentioned here




I hope that helps.

Cheers,
Bowen

On 02/02/2024 20:57, Kristijonas Zalys wrote:


Hi folks,


I am working on switching from full to incremental repair in
Cassandra v4.0.6 (soon to be v4.1.3) and I have a few questions.


1.

Is it necessary to run regular full repair on a cluster if I
already run incremental repair? If yes, what frequency would
you recommend for full repair?

2.

Has anyone experienced disk usage spikes while using
incremental repair? I have noticed temporary disk footprint
increases of up to 2x (from ~15 GiB to ~30 GiB) caused by
anti-compaction while testing and am wondering how likely
that is to happen in bigger real world use cases?


Thank you all in advance!

Kristijonas



Re: Switching to Incremental Repair

2024-02-02 Thread manish khandelwal
They(incremental and full repairs) are required to run separately at
different times. You need to identify a schedule, for example, running
incremental repairs every week for 3 weeks and then run full repair in the
4th week.

Regards
Manish

On Sat, Feb 3, 2024 at 7:29 AM Kristijonas Zalys  wrote:

> Hi Bowen,
>
> Thank you for your help!
>
> So given that we would need to run both incremental and full repair for a
> given cluster, is it safe to have both types of repair running for the same
> token ranges at the same time? Would it not create a race condition?
>
> Thanks,
> Kristijonas
>
> On Fri, Feb 2, 2024 at 3:36 PM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> Hi Kristijonas,
>>
>> To answer your questions:
>>
>> 1. It's still necessary to run full repair on a cluster on which
>> incremental repair is run periodically. The frequency of full repair is
>> more of an art than science. Generally speaking, the less reliable the
>> storage media, the more frequently full repair should be run. The
>> documentation on this topic is available here
>> 
>>
>> 2. Run incremental repair for the first time on an existing cluster does
>> cause Cassandra to re-compact all SSTables, and can lead to disk usage
>> spikes. This can be avoided by following the steps mentioned here
>> 
>>
>> I hope that helps.
>>
>> Cheers,
>> Bowen
>> On 02/02/2024 20:57, Kristijonas Zalys wrote:
>>
>> Hi folks,
>>
>> I am working on switching from full to incremental repair in Cassandra
>> v4.0.6 (soon to be v4.1.3) and I have a few questions.
>>
>>
>>1.
>>
>>Is it necessary to run regular full repair on a cluster if I already
>>run incremental repair? If yes, what frequency would you recommend for 
>> full
>>repair?
>>2.
>>
>>Has anyone experienced disk usage spikes while using incremental
>>repair? I have noticed temporary disk footprint increases of up to 2x 
>> (from
>>~15 GiB to ~30 GiB) caused by anti-compaction while testing and am
>>wondering how likely that is to happen in bigger real world use cases?
>>
>>
>> Thank you all in advance!
>>
>> Kristijonas
>>
>>


Re: Switching to Incremental Repair

2024-02-02 Thread Kristijonas Zalys
Hi Bowen,

Thank you for your help!

So given that we would need to run both incremental and full repair for a
given cluster, is it safe to have both types of repair running for the same
token ranges at the same time? Would it not create a race condition?

Thanks,
Kristijonas

On Fri, Feb 2, 2024 at 3:36 PM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Hi Kristijonas,
>
> To answer your questions:
>
> 1. It's still necessary to run full repair on a cluster on which
> incremental repair is run periodically. The frequency of full repair is
> more of an art than science. Generally speaking, the less reliable the
> storage media, the more frequently full repair should be run. The
> documentation on this topic is available here
> 
>
> 2. Run incremental repair for the first time on an existing cluster does
> cause Cassandra to re-compact all SSTables, and can lead to disk usage
> spikes. This can be avoided by following the steps mentioned here
> 
>
> I hope that helps.
>
> Cheers,
> Bowen
> On 02/02/2024 20:57, Kristijonas Zalys wrote:
>
> Hi folks,
>
> I am working on switching from full to incremental repair in Cassandra
> v4.0.6 (soon to be v4.1.3) and I have a few questions.
>
>
>1.
>
>Is it necessary to run regular full repair on a cluster if I already
>run incremental repair? If yes, what frequency would you recommend for full
>repair?
>2.
>
>Has anyone experienced disk usage spikes while using incremental
>repair? I have noticed temporary disk footprint increases of up to 2x (from
>~15 GiB to ~30 GiB) caused by anti-compaction while testing and am
>wondering how likely that is to happen in bigger real world use cases?
>
>
> Thank you all in advance!
>
> Kristijonas
>
>


Re: Switching to Incremental Repair

2024-02-02 Thread Bowen Song via user

Hi Kristijonas,

To answer your questions:

1. It's still necessary to run full repair on a cluster on which 
incremental repair is run periodically. The frequency of full repair is 
more of an art than science. Generally speaking, the less reliable the 
storage media, the more frequently full repair should be run. The 
documentation on this topic is available here 



2. Run incremental repair for the first time on an existing cluster does 
cause Cassandra to re-compact all SSTables, and can lead to disk usage 
spikes. This can be avoided by following the steps mentioned here 
 



I hope that helps.

Cheers,
Bowen

On 02/02/2024 20:57, Kristijonas Zalys wrote:


Hi folks,


I am working on switching from full to incremental repair in Cassandra 
v4.0.6 (soon to be v4.1.3) and I have a few questions.



1.

Is it necessary to run regular full repair on a cluster if I
already run incremental repair? If yes, what frequency would you
recommend for full repair?

2.

Has anyone experienced disk usage spikes while using incremental
repair? I have noticed temporary disk footprint increases of up to
2x (from ~15 GiB to ~30 GiB) caused by anti-compaction while
testing and am wondering how likely that is to happen in bigger
real world use cases?


Thank you all in advance!

Kristijonas



Switching to Incremental Repair

2024-02-02 Thread Kristijonas Zalys
Hi folks,

I am working on switching from full to incremental repair in Cassandra
v4.0.6 (soon to be v4.1.3) and I have a few questions.


   1.

   Is it necessary to run regular full repair on a cluster if I already run
   incremental repair? If yes, what frequency would you recommend for full
   repair?
   2.

   Has anyone experienced disk usage spikes while using incremental repair?
   I have noticed temporary disk footprint increases of up to 2x (from ~15 GiB
   to ~30 GiB) caused by anti-compaction while testing and am wondering how
   likely that is to happen in bigger real world use cases?


Thank you all in advance!

Kristijonas