Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-16 Thread Caleb Rackliffe
The patches for this issue have gotten a +1 from Mick, and that meets our
strict 2 committer rule, but I'm posting them here in case anyone else
wants take a look:

4.0: https://github.com/apache/cassandra/pull/3526
4.1: https://github.com/apache/cassandra/pull/3539
5.0: https://github.com/apache/cassandra/pull/3544

On Sun, Sep 15, 2024 at 10:21 AM Brandon Williams  wrote:

> +1 for declaring things like this in the vote/release emails.
>
> Kind Regards,
> Brandon
>
> On Sun, Sep 15, 2024 at 10:17 AM Michael Shuler 
> wrote:
> >
> > I've stated this same opinion prior, fix was in NEWS.txt and users had a
> > bad day with a behavior change. Perhaps this kind of inclusion needs
> > more exposure. This is exactly the sort of change may *also* be best
> > stated clearly in the vote/release emails, ala:
> >
> > 
> > ==
> > This release contains a relatively major change you should be aware of,
> > with xyz fix, new default, and with the behavior change, here are the
> > new defaults, how to disable, etc...
> > ==
> > kthxbai.
> >
> > Just do everything possible to convey what operators need to know,
> > wherever we can, if we include the fix. Link to NEWS.txt is insufficient
> > in this sort of case.
> >
> > Warm regards,
> > Michael
> >
> > On 9/12/24 13:16, Abe Ratnofsky wrote:
> > > Expressing another vote in favor of rejection-by-default. If a user
> doesn't want to lose sleep for data loss while on-call, they can read
> NEWS.txt and disable rejection.
> > >
>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-15 Thread Brandon Williams
+1 for declaring things like this in the vote/release emails.

Kind Regards,
Brandon

On Sun, Sep 15, 2024 at 10:17 AM Michael Shuler  wrote:
>
> I've stated this same opinion prior, fix was in NEWS.txt and users had a
> bad day with a behavior change. Perhaps this kind of inclusion needs
> more exposure. This is exactly the sort of change may *also* be best
> stated clearly in the vote/release emails, ala:
>
> 
> ==
> This release contains a relatively major change you should be aware of,
> with xyz fix, new default, and with the behavior change, here are the
> new defaults, how to disable, etc...
> ==
> kthxbai.
>
> Just do everything possible to convey what operators need to know,
> wherever we can, if we include the fix. Link to NEWS.txt is insufficient
> in this sort of case.
>
> Warm regards,
> Michael
>
> On 9/12/24 13:16, Abe Ratnofsky wrote:
> > Expressing another vote in favor of rejection-by-default. If a user doesn't 
> > want to lose sleep for data loss while on-call, they can read NEWS.txt and 
> > disable rejection.
> >


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-15 Thread Michael Shuler
I've stated this same opinion prior, fix was in NEWS.txt and users had a 
bad day with a behavior change. Perhaps this kind of inclusion needs 
more exposure. This is exactly the sort of change may *also* be best 
stated clearly in the vote/release emails, ala:



==
This release contains a relatively major change you should be aware of, 
with xyz fix, new default, and with the behavior change, here are the 
new defaults, how to disable, etc...

==
kthxbai.

Just do everything possible to convey what operators need to know, 
wherever we can, if we include the fix. Link to NEWS.txt is insufficient 
in this sort of case.


Warm regards,
Michael

On 9/12/24 13:16, Abe Ratnofsky wrote:

Expressing another vote in favor of rejection-by-default. If a user doesn't 
want to lose sleep for data loss while on-call, they can read NEWS.txt and 
disable rejection.



Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-13 Thread Benedict
It’s worth noting though that a very large engineering effort called “Transactional Cluster Metadata” is already wrapping up that properly addresses these problems, but that will be landing in 5.1 and won’t be suitable for back-porting.On 13 Sep 2024, at 21:32, Caleb Rackliffe  wrote:I'd encourage you to start a new DISCUSS thread around that.On Fri, Sep 13, 2024 at 2:38 PM Jaydeep Chovatia  wrote:Rejecting/logging the traffic is a significant step forward, but that does not solve the real problem. It still degrades the workload and requires manual operator's involvement.How about we also enhance Cassandra to automatically detect and fix the token ownership mismatch between StorageService and Gossip cache? More details to this ticket: https://issues.apache.org/jira/browse/CASSANDRA-18758JaydeepOn Thu, Sep 12, 2024 at 9:07 AM Caleb Rackliffe  wrote:Until we release TCM, it will continue to be possible for nodes to have a divergent view of the ring, and this means operations can still be sent to the wrong nodes. For example, writes may be sent to nodes that do not and never will own that data, and this opens us up to rather devious silent data loss problems.As some of you may have seen, there is a patch available for 4.0, 4.1, and 5.0 in CASSANDRA-13704 that provides a set of guardrails in the meantime for out-of-range operations. Essentially, there are two new YAML options that control whether or not to log warnings and/or reject operations that shouldn't have arrived at a receiving node.Given that simply logging and recording metrics isn't that invasive, the question we need to answer here is whether we should reject out-of-range operations by default, even in these patch releases. (5.0 has just barely been released, so I'm not sure if that really qualifies, but I digress.) The position I'd like to take is that this is essentially a matter of correctness, and we should enable rejection by default. (Keep in mind that both new options are settable at runtime via JMX.) There is precedent for doing something similar to this in CASSANDRA-12126.The one consequence of that we might discuss here is that if gossip is behind in notifying a node with a pending range, local rejection as it receives writes for that range may cause a small issue of availability. However, this shouldn't happen in a healthy cluster, and even if it does, we're simply translating a silent potential data loss bug into a transient but necessary availability gap with reasonable logging and visibility.




Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-13 Thread Caleb Rackliffe
I'd encourage you to start a new DISCUSS thread around that.

On Fri, Sep 13, 2024 at 2:38 PM Jaydeep Chovatia 
wrote:

>
> Rejecting/logging the traffic is a significant step forward, but that does
> not solve the real problem. It still degrades the workload and requires
> manual operator's involvement.
>
> How about we also enhance Cassandra to automatically detect and fix the
> token ownership mismatch between StorageService and Gossip cache? More
> details to this ticket:
> https://issues.apache.org/jira/browse/CASSANDRA-18758
>
> Jaydeep
>
> On Thu, Sep 12, 2024 at 9:07 AM Caleb Rackliffe 
> wrote:
>
>> Until we release TCM, it will continue to be possible for nodes to have a
>> divergent view of the ring, and this means operations can still be sent to
>> the wrong nodes. For example, writes may be sent to nodes that do not and
>> never will own that data, and this opens us up to rather devious silent
>> data loss problems.
>>
>> As some of you may have seen, there is a patch available for 4.0, 4.1,
>> and 5.0 in CASSANDRA-13704
>>  that provides a
>> set of guardrails in the meantime for out-of-range operations. Essentially,
>> there are two new YAML options that control whether or not to log warnings
>> and/or reject operations that shouldn't have arrived at a receiving node.
>>
>> Given that simply logging and recording metrics isn't that invasive, the
>> question we need to answer here is whether we should reject out-of-range
>> operations by default, even in these patch releases. (5.0 has just barely
>> been released, so I'm not sure if that really qualifies, but I digress.)
>> The position I'd like to take is that this is essentially a matter of
>> correctness, and we should *enable rejection by default*. (Keep in mind
>> that both new options are settable at runtime via JMX.) There is precedent
>> for doing something similar to this in CASSANDRA-12126
>> .
>>
>> The one consequence of that we might discuss here is that if gossip is
>> behind in notifying a node with a pending range, local rejection as it
>> receives writes for that range may cause a small issue of availability.
>> However, this shouldn't happen in a healthy cluster, and even if it does,
>> we're simply translating a silent potential data loss bug into a transient
>> but necessary availability gap with reasonable logging and visibility.
>>
>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-13 Thread Jaydeep Chovatia
Rejecting/logging the traffic is a significant step forward, but that does
not solve the real problem. It still degrades the workload and requires
manual operator's involvement.

How about we also enhance Cassandra to automatically detect and fix the
token ownership mismatch between StorageService and Gossip cache? More
details to this ticket:
https://issues.apache.org/jira/browse/CASSANDRA-18758

Jaydeep

On Thu, Sep 12, 2024 at 9:07 AM Caleb Rackliffe 
wrote:

> Until we release TCM, it will continue to be possible for nodes to have a
> divergent view of the ring, and this means operations can still be sent to
> the wrong nodes. For example, writes may be sent to nodes that do not and
> never will own that data, and this opens us up to rather devious silent
> data loss problems.
>
> As some of you may have seen, there is a patch available for 4.0, 4.1, and
> 5.0 in CASSANDRA-13704
>  that provides a
> set of guardrails in the meantime for out-of-range operations. Essentially,
> there are two new YAML options that control whether or not to log warnings
> and/or reject operations that shouldn't have arrived at a receiving node.
>
> Given that simply logging and recording metrics isn't that invasive, the
> question we need to answer here is whether we should reject out-of-range
> operations by default, even in these patch releases. (5.0 has just barely
> been released, so I'm not sure if that really qualifies, but I digress.)
> The position I'd like to take is that this is essentially a matter of
> correctness, and we should *enable rejection by default*. (Keep in mind
> that both new options are settable at runtime via JMX.) There is precedent
> for doing something similar to this in CASSANDRA-12126
> .
>
> The one consequence of that we might discuss here is that if gossip is
> behind in notifying a node with a pending range, local rejection as it
> receives writes for that range may cause a small issue of availability.
> However, this shouldn't happen in a healthy cluster, and even if it does,
> we're simply translating a silent potential data loss bug into a transient
> but necessary availability gap with reasonable logging and visibility.
>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-13 Thread Caleb Rackliffe
If it makes anyone feel better, 2600 of the 3600-lines of this patch are
tests (and the rest is minor refactoring of the verb handlers).

Anyway, glad to see a ton of participation here. I'll get back into
implementation space today, and start dealing with review feedback as it
comes in...

P.S. I want to call out/thank Sam Tunnicliffe, whose unpublished work is
the basis for the patch I've put together.

On Fri, Sep 13, 2024 at 8:52 AM Brandon Williams  wrote:

> On Thu, Sep 12, 2024 at 8:34 PM Josh McKenzie 
> wrote:
> > I'm not advocating for us having a rigid principled stance where we
> reject all nuance and don't discuss things. I'm advocating for us
> coalescing on a shared default stance of correctness unless otherwise
> excepted. We know we're a diverse group, we're all different people with
> different histories / values / opinions / cultures, and I think that's what
> makes this community as effective as it is.
> >
> > But I don't think it's healthy for us to repeatedly re-litigate whether
> data loss is acceptable based on how long it's been around, or how
> frequently some of us on the project have observed some given phenomenon.
> My gut tells me we'd all be in a better place if we all started from 0 on a
> discussion like this as "Ok, data loss is unacceptable. Unless otherwise
> warranted, we should do all we can to fix this on all supported branches as
> our default response".
>
> I think this absolutely makes sense in situations where things are
> clear cut and just need to be corrected, like an off-by-one error.  We
> discover the problem, we fix it.
>
> However, the current situation is not that.  You say age is not
> relevant, and though that may be true for the age of the bug (which is
> before my time!), I do think it's relevant that we've known about this
> (documented by the ticket) for at least 7 years, but only now have
> decided to address it.  Furthermore, this isn't a straight correction,
> from the very beginning a possible tradeoff with availability in some
> circumstances was mentioned.  We are talking about changing behavior
> on top of a 200k delta in ~4k lines of code across a significant
> amount of files, and doing so in the 13th minor release of a 4 year
> old major, for a problem we have known about long enough to have put
> this in all current major releases.  I don't think painting the
> picture with the "data loss" brush alone tells us everything we need
> to know to make that kind of commitment to what is now our old stable
> branch.
>
> That said, I want to thank Scott and others who provided helpful
> context here.  I still think it is in the realm of possibility that
> changing behavior can cause a problem for an existing user, and they
> will be right to be angry if that happens, but that is an easier pill
> to swallow if we are possibly preventing data loss that is easier to
> encounter than previously thought.  I support doing this in all the
> current branches.
>
> Kind Regards,
> Brandon
>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-13 Thread Brandon Williams
On Thu, Sep 12, 2024 at 8:34 PM Josh McKenzie  wrote:
> I'm not advocating for us having a rigid principled stance where we reject 
> all nuance and don't discuss things. I'm advocating for us coalescing on a 
> shared default stance of correctness unless otherwise excepted. We know we're 
> a diverse group, we're all different people with different histories / values 
> / opinions / cultures, and I think that's what makes this community as 
> effective as it is.
>
> But I don't think it's healthy for us to repeatedly re-litigate whether data 
> loss is acceptable based on how long it's been around, or how frequently some 
> of us on the project have observed some given phenomenon. My gut tells me 
> we'd all be in a better place if we all started from 0 on a discussion like 
> this as "Ok, data loss is unacceptable. Unless otherwise warranted, we should 
> do all we can to fix this on all supported branches as our default response".

I think this absolutely makes sense in situations where things are
clear cut and just need to be corrected, like an off-by-one error.  We
discover the problem, we fix it.

However, the current situation is not that.  You say age is not
relevant, and though that may be true for the age of the bug (which is
before my time!), I do think it's relevant that we've known about this
(documented by the ticket) for at least 7 years, but only now have
decided to address it.  Furthermore, this isn't a straight correction,
from the very beginning a possible tradeoff with availability in some
circumstances was mentioned.  We are talking about changing behavior
on top of a 200k delta in ~4k lines of code across a significant
amount of files, and doing so in the 13th minor release of a 4 year
old major, for a problem we have known about long enough to have put
this in all current major releases.  I don't think painting the
picture with the "data loss" brush alone tells us everything we need
to know to make that kind of commitment to what is now our old stable
branch.

That said, I want to thank Scott and others who provided helpful
context here.  I still think it is in the realm of possibility that
changing behavior can cause a problem for an existing user, and they
will be right to be angry if that happens, but that is an easier pill
to swallow if we are possibly preventing data loss that is easier to
encounter than previously thought.  I support doing this in all the
current branches.

Kind Regards,
Brandon


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-13 Thread Mick Semb Wever
reply below.



> Mick - this patch doesn't fix things 100%. It can't. BUT - it does take us
> from "In all cases where this occurs you will silently lose data" to "in
> some cases where this occurs you will have a rejected write, in others
> you'll have coordinator level logging, and in the worst split-brain case
> you can still have data loss (split-brain ownership where the quorum agrees
> on incorrect state)".
>


I agree with you Josh (and everyone else that's expressed the same
sentiment).  I have no veto against what the defaults are and am happy to
go with any consensus.  My nit-pickings and explorations I can move to the
ticket.


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-13 Thread Josh McKenzie
I think it's worth exploring where the disconnect is just a *bit* more, though 
I agree with you Benedict in that there appears to be a clear consensus so the 
thread has evolved more into talking about our principles than what to do in 
this specific scenario.

Mick - this patch doesn't fix things 100%. It can't. BUT - it does take us from 
"In all cases where this occurs you will silently lose data" to "in some cases 
where this occurs you will have a rejected write, in others you'll have 
coordinator level logging, and in the worst split-brain case you can still have 
data loss (split-brain ownership where the quorum agrees on incorrect state)".

Whether we're 100% fixing data loss, or 10%, at the end of the day my position 
is that we should fix all the data loss we can (this statement is unique to 
gossip's architectural limitations) *even if that means loss of availability in 
edge-cases where it is not strictly required*. I think where we might be 
misaligned:
> It can prevent data mislocation in some scenarios but it offers no guarantees 
> about that, and can also degrade availability unnecessarily. 
It *can't* prevent mis-location in all scenarios. That's a shortcoming of 
Gossip's architecture.

If the availability degradation is part and parcel to the mitigation, and the 
mitigation approach is the *best worst option* in an eventually consistent 
topology sync environment, *it's not unnecessary*. It's the price we pay for 
mitigating a flaw in the architecture as best we can.

A comparable: if we have 1% data loss, and the only way we can mitigate that is 
to have a 1% hit in availability instead, the former is clearly worse for our 
users than the latter. Recovery from data loss is multiple orders of magnitude 
harder than retrying a query on availability degradation.

On Fri, Sep 13, 2024, at 8:15 AM, Benedict wrote:
> 
> I think everyone has made their case, the costs and benefits are fairly well 
> understood, and there appears to be a strong quorum that favours an approach 
> that prioritises avoiding data loss. 
> 
> So, I propose we either recognise that this is the clear position of the 
> community, or move to a vote to formalise it one way or the other.
> 
> 
>> On 13 Sep 2024, at 13:02, Alex Petrov  wrote:
>> 
>> I agree with folks saying that we absolutely need to reject misplaced 
>> writes. It may not preclude coordinator making a local write, or making a 
>> write to a local replica, but even reducing probability of a misplaced write 
>> shown as success to the client is a substantial win. 
>> 
>> On Fri, Sep 13, 2024, at 10:41 AM, Mick Semb Wever wrote:
>>> replies below (to Scott, Josh and Jeremiah).
>>> 
>>> tl;dr all my four points remain undisputed, when the patch is applied.   
>>> This is a messy situation, but no denying the value of rejection writes to 
>>> various known popular scenarios.  Point (1) remains important to highlight 
>>> IMHO.
>>> 
>>> 
>>> On Fri, 13 Sept 2024 at 03:03, C. Scott Andreas  
>>> wrote:
 Since that time, I’ve received several urgent messages from major users of 
 Apache Cassandra and even customers of Cassandra ecosystem vendors asking 
 about this bug. Some were able to verify the presence of lost data in 
 SSTables on nodes where it didn’t belong, demonstrate empty read responses 
 for data that is known proof-positive to exist (think content-addressable 
 stores), or reproduce this behavior in a local cluster after forcing 
 disagreement.
 
>>> 
>>> 
>>> 
>>> Having been privy to the background of those "urgent messages" I can say 
>>> the information you received wasn't correct (or complete). 
>>> 
>>> My challenge on this thread is about understanding where this might 
>>> unexpectedly bite users, which should be part of our due diligence when 
>>> applying such patches to stable branches.   I ask you to run through my 
>>> four points, which AFAIK still stand true.  
>>> 
>>> 
 But I **don't** think it's healthy for us to repeatedly re-litigate 
 whether data loss is acceptable based on how long it's been around, or how 
 frequently some of us on the project have observed some given phenomenon. 
>>> 
>>> 
>>> Josh, that's true, but talking to these things helps open up the 
>>> discussion, see my point above wrt being aware of second-hand evidence that 
>>> was inaccurate.
>>> 
>>> 
 The severity and frequency of this issue combined with the business risk 
 to Apache Cassandra users changed my mind about fixing it in earlier 
 branches despite TCM having been merged to fix it for good on trunk.
 
>>> 
>>> 
>>> That shouldn't prevent us from investigating known edge-cases, collateral 
>>> damage, and unexpected behavioural changes in patch versions.   
>>> 
>>>  
>>> 
>>>  
> On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan  
> wrote:
>> 1. Rejecting writes does not prevent data loss in this situation.  It 
>> only reduces it.  The investigation and remediation of p

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-13 Thread Benedict
I think everyone has made their case, the costs and benefits are fairly well understood, and there appears to be a strong quorum that favours an approach that prioritises avoiding data loss. So, I propose we either recognise that this is the clear position of the community, or move to a vote to formalise it one way or the other.On 13 Sep 2024, at 13:02, Alex Petrov  wrote:I agree with folks saying that we absolutely need to reject misplaced writes. It may not preclude coordinator making a local write, or making a write to a local replica, but even reducing probability of a misplaced write shown as success to the client is a substantial win. On Fri, Sep 13, 2024, at 10:41 AM, Mick Semb Wever wrote:replies below (to Scott, Josh and Jeremiah).tl;dr all my four points remain undisputed, when the patch is applied.   This is a messy situation, but no denying the value of rejection writes to various known popular scenarios.  Point (1) remains important to highlight IMHO.On Fri, 13 Sept 2024 at 03:03, C. Scott Andreas  wrote:Since that time, I’ve received several urgent messages from major users of Apache Cassandra and even customers of Cassandra ecosystem vendors asking about this bug. Some were able to verify the presence of lost data in SSTables on nodes where it didn’t belong, demonstrate empty read responses for data that is known proof-positive to exist (think content-addressable stores), or reproduce this behavior in a local cluster after forcing disagreement.Having been privy to the background of those "urgent messages" I can say the information you received wasn't correct (or complete). My challenge on this thread is about understanding where this might unexpectedly bite users, which should be part of our due diligence when applying such patches to stable branches.   I ask you to run through my four points, which AFAIK still stand true.  But I don't think it's healthy for us to repeatedly re-litigate whether data loss is acceptable based on how long it's been around, or how frequently some of us on the project have observed some given phenomenon. Josh, that's true, but talking to these things helps open up the discussion, see my point above wrt being aware of second-hand evidence that was inaccurate.The severity and frequency of this issue combined with the business risk to Apache Cassandra users changed my mind about fixing it in earlier branches despite TCM having been merged to fix it for good on trunk.That shouldn't prevent us from investigating known edge-cases, collateral damage, and unexpected behavioural changes in patch versions. On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan  wrote:1. Rejecting writes does not prevent data loss in this situation.  It only reduces it.  The investigation and remediation of possible mislocated data is still required.All nodes which reject a write prevent mislocated data.  There is still the possibility of some node having the same wrong view of the ring as the coordinator (including if they are the same node) accepting data.  Unless there are multiple nodes with the same wrong view of the ring, data loss is prevented for CL > ONE.(1) stands true, for all CLs.  I think this is pretty important here.With writes rejection enabled, we can tell people it may have prevented a lot of data mislocation and is of great benefit and safety, but there's no guarantee that it's prevented all data mislocation.  If an operator encounters writes rejected in this manner they must still go investigate a possible data loss situation.  We are aware of our own situations where we have been hit by this, and they come in a number of variants, but we can't speak to every situation users will find themselves in.   We're making a trade-off here of reduced availability against more forceful alerting and an alleviation of data mislocation.   2. Rejecting writes is a louder form of alerting for users unaware of the scenario, those not already monitoring logs or metrics.Without this patch no one is aware of any issues at all.  Maybe you are referring to a situation where the patch is applied, but the default behavior is to still accept the “bad” data?  In that case yes, turning on rejection makes it “louder” in that your queries can fail if too many nodes are wrong.(2) stands true.    Rejecting is a louder alert, but it is not complete, see next point.  (All four points are made with the patch applied.) 3. Rejecting writes does not capture all places where the problem is occurring.  Only logging/metrics fully captures everywhere the problem is occurring.Not sure what you are saying here.Rejected writes can be swallowed by a coordinator sending background writes to other nodes when it has already ack'd the response to the client.  If the operator wants a complete and accurate overview of out-of-range writes they have to look at the logs/metrics.(3) stands true. 4. … nodes can be rejecting writes when they are in fact correct hence causing “over-e

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-13 Thread Alex Petrov
I agree with folks saying that we absolutely need to reject misplaced writes. 
It may not preclude coordinator making a local write, or making a write to a 
local replica, but even reducing probability of a misplaced write shown as 
success to the client is a substantial win. 

On Fri, Sep 13, 2024, at 10:41 AM, Mick Semb Wever wrote:
> replies below (to Scott, Josh and Jeremiah).
> 
> tl;dr all my four points remain undisputed, when the patch is applied.   This 
> is a messy situation, but no denying the value of rejection writes to various 
> known popular scenarios.  Point (1) remains important to highlight IMHO.
> 
> 
> On Fri, 13 Sept 2024 at 03:03, C. Scott Andreas  wrote:
>> Since that time, I’ve received several urgent messages from major users of 
>> Apache Cassandra and even customers of Cassandra ecosystem vendors asking 
>> about this bug. Some were able to verify the presence of lost data in 
>> SSTables on nodes where it didn’t belong, demonstrate empty read responses 
>> for data that is known proof-positive to exist (think content-addressable 
>> stores), or reproduce this behavior in a local cluster after forcing 
>> disagreement.
>> 
> 
> 
> 
> Having been privy to the background of those "urgent messages" I can say the 
> information you received wasn't correct (or complete). 
> 
> My challenge on this thread is about understanding where this might 
> unexpectedly bite users, which should be part of our due diligence when 
> applying such patches to stable branches.   I ask you to run through my four 
> points, which AFAIK still stand true.  
> 
> 
>> But I **don't** think it's healthy for us to repeatedly re-litigate whether 
>> data loss is acceptable based on how long it's been around, or how 
>> frequently some of us on the project have observed some given phenomenon. 
> 
> 
> Josh, that's true, but talking to these things helps open up the discussion, 
> see my point above wrt being aware of second-hand evidence that was 
> inaccurate.
> 
> 
>> The severity and frequency of this issue combined with the business risk to 
>> Apache Cassandra users changed my mind about fixing it in earlier branches 
>> despite TCM having been merged to fix it for good on trunk.
>> 
> 
> 
> That shouldn't prevent us from investigating known edge-cases, collateral 
> damage, and unexpected behavioural changes in patch versions.   
> 
>  
> 
>  
>>> On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan  
>>> wrote:
 1. Rejecting writes does not prevent data loss in this situation.  It only 
 reduces it.  The investigation and remediation of possible mislocated data 
 is still required.
>>> All nodes which reject a write prevent mislocated data.  There is still the 
>>> possibility of some node having the same wrong view of the ring as the 
>>> coordinator (including if they are the same node) accepting data.  Unless 
>>> there are multiple nodes with the same wrong view of the ring, data loss is 
>>> prevented for CL > ONE.
> 
> 
> (1) stands true, for all CLs.  I think this is pretty important here.
> 
> With writes rejection enabled, we can tell people it may have prevented a lot 
> of data mislocation and is of great benefit and safety, but there's no 
> guarantee that it's prevented all data mislocation.  If an operator 
> encounters writes rejected in this manner they must still go investigate a 
> possible data loss situation.  
> 
> We are aware of our own situations where we have been hit by this, and they 
> come in a number of variants, but we can't speak to every situation users 
> will find themselves in.   We're making a trade-off here of reduced 
> availability against more forceful alerting and an alleviation of data 
> mislocation.   
> 
> 
> 
>>> 
 2. Rejecting writes is a louder form of alerting for users unaware of the 
 scenario, those not already monitoring logs or metrics.
>>> Without this patch no one is aware of any issues at all.  Maybe you are 
>>> referring to a situation where the patch is applied, but the default 
>>> behavior is to still accept the “bad” data?  In that case yes, turning on 
>>> rejection makes it “louder” in that your queries can fail if too many nodes 
>>> are wrong.
> 
> (2) stands true.Rejecting is a louder alert, but it is not complete, see 
> next point.  (All four points are made with the patch applied.)
> 
>  
>>> 
 3. Rejecting writes does not capture all places where the problem is 
 occurring.  Only logging/metrics fully captures everywhere the problem is 
 occurring.
>>> 
>>> Not sure what you are saying here.
> 
> Rejected writes can be swallowed by a coordinator sending background writes 
> to other nodes when it has already ack'd the response to the client.  If the 
> operator wants a complete and accurate overview of out-of-range writes they 
> have to look at the logs/metrics.
> 
> (3) stands true.
> 
>  
>>> 
 4. … nodes can be rejecting writes when they are in fact correct hence 
 causing “over-eager unava

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-13 Thread Mick Semb Wever
replies below (to Scott, Josh and Jeremiah).

tl;dr all my four points remain undisputed, when the patch is applied.
This is a messy situation, but no denying the value of rejection writes to
various known popular scenarios.  Point (1) remains important to highlight
IMHO.


On Fri, 13 Sept 2024 at 03:03, C. Scott Andreas 
wrote:

> Since that time, I’ve received several urgent messages from major users of
> Apache Cassandra and even customers of Cassandra ecosystem vendors asking
> about this bug. Some were able to verify the presence of lost data in
> SSTables on nodes where it didn’t belong, demonstrate empty read responses
> for data that is known proof-positive to exist (think content-addressable
> stores), or reproduce this behavior in a local cluster after forcing
> disagreement.
>



Having been privy to the background of those "urgent messages" I can say
the information you received wasn't correct (or complete).

My challenge on this thread is about understanding where this might
unexpectedly bite users, which should be part of our due diligence when
applying such patches to stable branches.   I ask you to run through my
four points, which AFAIK still stand true.


But I *don't* think it's healthy for us to repeatedly re-litigate whether
> data loss is acceptable based on how long it's been around, or how
> frequently some of us on the project have observed some given phenomenon.


Josh, that's true, but talking to these things helps open up the
discussion, see my point above wrt being aware of second-hand evidence that
was inaccurate.


The severity and frequency of this issue combined with the business risk to
> Apache Cassandra users changed my mind about fixing it in earlier branches
> despite TCM having been merged to fix it for good on trunk.
>


That shouldn't prevent us from investigating known edge-cases, collateral
damage, and unexpected behavioural changes in patch versions.





> On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan 
> wrote:
>
> 1. Rejecting writes does not prevent data loss in this situation.  It only
>> reduces it.  The investigation and remediation of possible mislocated data
>> is still required.
>>
>
> All nodes which reject a write prevent mislocated data.  There is still
> the possibility of some node having the same wrong view of the ring as the
> coordinator (including if they are the same node) accepting data.  Unless
> there are multiple nodes with the same wrong view of the ring, data loss is
> prevented for CL > ONE.
>
>

(1) stands true, for all CLs.  I think this is pretty important here.

With writes rejection enabled, we can tell people it may have prevented a
lot of data mislocation and is of great benefit and safety, but there's no
guarantee that it's prevented all data mislocation.  If an operator
encounters writes rejected in this manner they must still go investigate a
possible data loss situation.

We are aware of our own situations where we have been hit by this, and they
come in a number of variants, but we can't speak to every situation users
will find themselves in.   We're making a trade-off here of reduced
availability against more forceful alerting and an alleviation of data
mislocation.


2. Rejecting writes is a louder form of alerting for users unaware of the
>> scenario, those not already monitoring logs or metrics.
>>
>
> Without this patch no one is aware of any issues at all.  Maybe you are
> referring to a situation where the patch is applied, but the default
> behavior is to still accept the “bad” data?  In that case yes, turning on
> rejection makes it “louder” in that your queries can fail if too many nodes
> are wrong.
>
>
(2) stands true.Rejecting is a louder alert, but it is not complete,
see next point.  (All four points are made with the patch applied.)



> 3. Rejecting writes does not capture all places where the problem is
>> occurring.  Only logging/metrics fully captures everywhere the problem is
>> occurring.
>>
>
> Not sure what you are saying here.
>
>
Rejected writes can be swallowed by a coordinator sending background writes
to other nodes when it has already ack'd the response to the client.  If
the operator wants a complete and accurate overview of out-of-range writes
they have to look at the logs/metrics.

(3) stands true.



> 4. … nodes can be rejecting writes when they are in fact correct hence
>> causing “over-eager unavailability”.
>>
>
> When would this occur?  I guess when the node with the bad ring
> information is a replica sent data from a coordinator with the correct ring
> state?  There would be no “unavailability” here unless there were multiple
> nodes in such a state.  I also again would not call this over eager,
> because the node with the bad ring state is f’ed up and needs to be fixed.
> So if being considered unavailable doesn’t seem over-eager to me.
>
>
This fails in a quorum write.  And the node need not be f'ed up, just
delayed in its view.

(4) stands true.


Given the fact that a user can r

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-13 Thread Berenguer Blasi
+1 to rejecting on all branches. Yes fixing bugs and problems change how 
things used to worked and some users will be surprised. But it's better 
than being surprised on an eventual data loss.


On 13/9/24 3:34, Josh McKenzie wrote:
Even when the fix is only partial, so really it's more about more 
forcefully alerting the operator to the problem via over-eager 
unavailability …?


Sometimes a principled stance can take us away from the important 
details in the discussions.
My understanding of the ticket (having not dug deeply into the code, 
just reviewed the JIRA and this thread), is that this is as effective 
of a solution as we can have in a non-deterministic, non-epoch based, 
non-transactional metadata system. i.e. Gossip. I don't see it as a 
partial fix but I might be misunderstanding.


I'm not advocating for us having a rigid principled stance where we 
reject all nuance and don't discuss things. I'm advocating for us 
coalescing on a shared */default/* stance of correctness unless 
otherwise excepted. We know we're a diverse group, we're all different 
people with different histories / values / opinions / cultures, and I 
think that's what makes this community as effective as it is.


But I /*don't*/ think it's healthy for us to repeatedly re-litigate 
whether data loss is acceptable based on how long it's been around, or 
how frequently some of us on the project have observed some given 
phenomenon. My gut tells me we'd all be in a better place if we all 
started from 0 on a discussion like this as "Ok, data loss is 
unacceptable. Unless otherwise warranted, we should do all we can to 
fix this on all supported branches as our default response".


On Thu, Sep 12, 2024, at 9:02 PM, C. Scott Andreas wrote:


Thanks all for discussion on this.


It’s hard to describe the sinking feeling that hit me when it became 
clear to me how common this problem is - and how horribly difficult 
it is to prove one has encountered this bug.



Two years ago, my understanding was that this is an exceptionally 
rare and transient issue unlikely to occur after all the work we put 
into gossip. My view was that gossip had basically been shorn up and 
that Transactional Metadata is the proper fix for this with its epoch 
design (which is true).



Since that time, I’ve received several urgent messages from major 
users of Apache Cassandra and even customers of Cassandra ecosystem 
vendors asking about this bug. Some were able to verify the presence 
of lost data in SSTables on nodes where it didn’t belong, demonstrate 
empty read responses for data that is known proof-positive to exist 
(think content-addressable stores), or reproduce this behavior in a 
local cluster after forcing disagreement.



The severity and frequency of this issue combined with the business 
risk to Apache Cassandra users changed my mind about fixing it in 
earlier branches despite TCM having been merged to fix it for good on 
trunk.



The guards in this patch are extensive: point reads, range reads, 
mutations, repair, incoming / outgoing streams, hints, merkle tree 
requests, and others I’m forgetting. They’re simple guards, and while 
they touch many subsystems, they’re not invasive changes.



There is no reasonable scenario that’s common enough that would 
justify disabling a guard preventing silent data loss by default. I 
appreciate that a prop exists to permit or warn in the presence of 
data loss for anyone who may want that, in the spirit of users being 
in control of their clusters’ behavior.



Very large operators may only see indications the guard took effect 
for a handful of queries per day — but in instances where ownership 
disagreement is prolonged, the patch is an essential guard against 
large-scale unrecoverable data loss and incorrect responses to 
queries. I’ll further take the position that those few queries in 
transient disagreement scenarios would be justification by themselves.



I support merging the patch to all proposed branches and enabling the 
guard by default.



– Scott


On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan 
 wrote:


1. Rejecting writes does not prevent data loss in this situation.  
It only reduces it.  The investigation and remediation of possible 
mislocated data is still required.


All nodes which reject a write prevent mislocated data.  There is 
still the possibility of some node having the same wrong view of the 
ring as the coordinator (including if they are the same node) 
accepting data.  Unless there are multiple nodes with the same wrong 
view of the ring, data loss is prevented for CL > ONE.


2. Rejecting writes is a louder form of alerting for users unaware 
of the scenario, those not already monitoring logs or metrics.


Without this patch no one is aware of any issues at all.  Maybe you 
are referring to a situation where the patch is applied, but the 
default behavior is to still accept the “bad” data?  In that case 
yes, turning on rejection makes it “louder” in that your 

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread guo Maxwell
Sorry for sending an interfering email . Please ignore the above one. My
wrong operation cannot be undone.

 For this DISCUSS, I personally +1 with enabling rejection by default. We
did something similar to address this issue.
Although the default behavior may change. But what we solve is the
correctness of the data storage (although it may be a preventive solution),
and I think that should be the most important thing about the database, so
that other things may not be so important.


guo Maxwell  于2024年9月13日周五 10:50写道:

>
> +1,默认在所有分支上启用拒绝。我们做了类似的事情来解决这个问题。虽然默认行为可能会改变。但我们解决的是数据存储的正确性,我认为这应该是数据库最重要的事情,这样其他事情可能就不那么重要了。
>
> Josh McKenzie  于2024年9月13日周五 09:34写道:
>
>> 即使修复只是部分的,那么实际上它更多的是通过过于急切的不可用性更有力地提醒操作员问题……?
>>
>> 有时原则立场可能会使我们远离讨论中的重要细节。
>>
>> 我对此票的理解(没有深入研究代码,只是查看了 JIRA
>> 和此线程)是,这是我们在非确定性、非基于纪元、非事务性元数据系统中可以找到的最有效的解决方案。即
>> Gossip。我不认为这是一个部分修复,但我可能误解了。
>>
>> 我并不主张我们采取僵化的原则立场,拒绝一切细微差别,不讨论任何事物。我主张我们团结一致,除非有其他例外,否则都坚持共同的*默认*
>> 正确立场。我们知道我们是一个多元化的群体,我们都是不同的人,有着不同的历史/价值观/观点/文化,我认为这就是这个社区如此有效的原因。
>>
>> 但我 
>> 认为,根据数据丢失发生的时间长短或项目中某些人观察到某种现象的频率,反复重新争论数据丢失是否可以接受,这对我们*并不*健康。我的直觉告诉我,如果我们都从
>> 0
>> 开始讨论,比如“好吧,数据丢失是不可接受的。除非另有保证,否则我们应该尽一切努力在所有支持的分支上修复此问题,作为我们的默认响应”,那么我们都会处于更好的位置。
>>
>> 2024 年 9 月 12 日星期四晚上 9:02,C. Scott Andreas 写道:
>>
>> 感谢大家对此的讨论。
>>
>>
>> 当我意识到这个问题有多么普遍,以及要证明自己遇到了这个错误有多么困难时,我很难描述那种沉重的感觉。
>>
>>
>> 两年前,我的理解是,这是一个极其罕见且短暂的问题,在我们为 Gossip 投入大量精力之后,不太可能发生。我的观点是,Gossip
>> 基本上已经被解决了,而事务元数据正是通过其纪元设计来解决这个问题的正确方法(这是真的)。
>>
>>
>> Since that time, I’ve received several urgent messages from major users
>> of Apache Cassandra and even customers of Cassandra ecosystem vendors
>> asking about this bug. Some were able to verify the presence of lost data
>> in SSTables on nodes where it didn’t belong, demonstrate empty read
>> responses for data that is known proof-positive to exist (think
>> content-addressable stores), or reproduce this behavior in a local cluster
>> after forcing disagreement.
>>
>>
>> The severity and frequency of this issue combined with the business risk
>> to Apache Cassandra users changed my mind about fixing it in earlier
>> branches despite TCM having been merged to fix it for good on trunk.
>>
>>
>> The guards in this patch are extensive: point reads, range reads,
>> mutations, repair, incoming / outgoing streams, hints, merkle tree
>> requests, and others I’m forgetting. They’re simple guards, and while they
>> touch many subsystems, they’re not invasive changes.
>>
>>
>> There is no reasonable scenario that’s common enough that would justify
>> disabling a guard preventing silent data loss by default. I appreciate that
>> a prop exists to permit or warn in the presence of data loss for anyone who
>> may want that, in the spirit of users being in control of their clusters’
>> behavior.
>>
>>
>> Very large operators may only see indications the guard took effect for a
>> handful of queries per day — but in instances where ownership disagreement
>> is prolonged, the patch is an essential guard against large-scale
>> unrecoverable data loss and incorrect responses to queries. I’ll further
>> take the position that those few queries in transient disagreement
>> scenarios would be justification by themselves.
>>
>>
>> I support merging the patch to all proposed branches and enabling the
>> guard by default.
>>
>>
>> – Scott
>>
>> On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan 
>> wrote:
>>
>> 
>>
>> 1. Rejecting writes does not prevent data loss in this situation.  It
>> only reduces it.  The investigation and remediation of possible mislocated
>> data is still required.
>>
>>
>> All nodes which reject a write prevent mislocated data.  There is still
>> the possibility of some node having the same wrong view of the ring as the
>> coordinator (including if they are the same node) accepting data.  Unless
>> there are multiple nodes with the same wrong view of the ring, data loss is
>> prevented for CL > ONE.
>>
>> 2. Rejecting writes is a louder form of alerting for users unaware of the
>> scenario, those not already monitoring logs or metrics.
>>
>>
>> Without this patch no one is aware of any issues at all.  Maybe you are
>> referring to a situation where the patch is applied, but the default
>> behavior is to still accept the “bad” data?  In that case yes, turning on
>> rejection makes it “louder” in that your queries can fail if too many nodes
>> are wrong.
>>
>> 3. Rejecting writes does not capture all places where the problem is
>> occurring.  Only logging/metrics fully captures everywhere the problem is
>> occurring.
>>
>>
>> Not sure what you are saying here.
>>
>> nodes can be rejecting writes when they are in fact correct hence causing 
>> “over-eager
>> unavailability”.
>>
>>
>> When would this occur?  I guess when the node with the bad ring
>> information is a replica sent data from a coordinator with the correct ring
>> state?  There would be no “unavailability” here unless there were multiple
>> nodes in such a state.  I also again would not call this over eager,
>> because the node w

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread guo Maxwell
+1,默认在所有分支上启用拒绝。我们做了类似的事情来解决这个问题。虽然默认行为可能会改变。但我们解决的是数据存储的正确性,我认为这应该是数据库最重要的事情,这样其他事情可能就不那么重要了。

Josh McKenzie  于2024年9月13日周五 09:34写道:

> 即使修复只是部分的,那么实际上它更多的是通过过于急切的不可用性更有力地提醒操作员问题……?
>
> 有时原则立场可能会使我们远离讨论中的重要细节。
>
> 我对此票的理解(没有深入研究代码,只是查看了 JIRA
> 和此线程)是,这是我们在非确定性、非基于纪元、非事务性元数据系统中可以找到的最有效的解决方案。即
> Gossip。我不认为这是一个部分修复,但我可能误解了。
>
> 我并不主张我们采取僵化的原则立场,拒绝一切细微差别,不讨论任何事物。我主张我们团结一致,除非有其他例外,否则都坚持共同的*默认*
> 正确立场。我们知道我们是一个多元化的群体,我们都是不同的人,有着不同的历史/价值观/观点/文化,我认为这就是这个社区如此有效的原因。
>
> 但我 
> 认为,根据数据丢失发生的时间长短或项目中某些人观察到某种现象的频率,反复重新争论数据丢失是否可以接受,这对我们*并不*健康。我的直觉告诉我,如果我们都从
> 0
> 开始讨论,比如“好吧,数据丢失是不可接受的。除非另有保证,否则我们应该尽一切努力在所有支持的分支上修复此问题,作为我们的默认响应”,那么我们都会处于更好的位置。
>
> 2024 年 9 月 12 日星期四晚上 9:02,C. Scott Andreas 写道:
>
> 感谢大家对此的讨论。
>
>
> 当我意识到这个问题有多么普遍,以及要证明自己遇到了这个错误有多么困难时,我很难描述那种沉重的感觉。
>
>
> 两年前,我的理解是,这是一个极其罕见且短暂的问题,在我们为 Gossip 投入大量精力之后,不太可能发生。我的观点是,Gossip
> 基本上已经被解决了,而事务元数据正是通过其纪元设计来解决这个问题的正确方法(这是真的)。
>
>
> Since that time, I’ve received several urgent messages from major users of
> Apache Cassandra and even customers of Cassandra ecosystem vendors asking
> about this bug. Some were able to verify the presence of lost data in
> SSTables on nodes where it didn’t belong, demonstrate empty read responses
> for data that is known proof-positive to exist (think content-addressable
> stores), or reproduce this behavior in a local cluster after forcing
> disagreement.
>
>
> The severity and frequency of this issue combined with the business risk
> to Apache Cassandra users changed my mind about fixing it in earlier
> branches despite TCM having been merged to fix it for good on trunk.
>
>
> The guards in this patch are extensive: point reads, range reads,
> mutations, repair, incoming / outgoing streams, hints, merkle tree
> requests, and others I’m forgetting. They’re simple guards, and while they
> touch many subsystems, they’re not invasive changes.
>
>
> There is no reasonable scenario that’s common enough that would justify
> disabling a guard preventing silent data loss by default. I appreciate that
> a prop exists to permit or warn in the presence of data loss for anyone who
> may want that, in the spirit of users being in control of their clusters’
> behavior.
>
>
> Very large operators may only see indications the guard took effect for a
> handful of queries per day — but in instances where ownership disagreement
> is prolonged, the patch is an essential guard against large-scale
> unrecoverable data loss and incorrect responses to queries. I’ll further
> take the position that those few queries in transient disagreement
> scenarios would be justification by themselves.
>
>
> I support merging the patch to all proposed branches and enabling the
> guard by default.
>
>
> – Scott
>
> On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan 
> wrote:
>
> 
>
> 1. Rejecting writes does not prevent data loss in this situation.  It only
> reduces it.  The investigation and remediation of possible mislocated data
> is still required.
>
>
> All nodes which reject a write prevent mislocated data.  There is still
> the possibility of some node having the same wrong view of the ring as the
> coordinator (including if they are the same node) accepting data.  Unless
> there are multiple nodes with the same wrong view of the ring, data loss is
> prevented for CL > ONE.
>
> 2. Rejecting writes is a louder form of alerting for users unaware of the
> scenario, those not already monitoring logs or metrics.
>
>
> Without this patch no one is aware of any issues at all.  Maybe you are
> referring to a situation where the patch is applied, but the default
> behavior is to still accept the “bad” data?  In that case yes, turning on
> rejection makes it “louder” in that your queries can fail if too many nodes
> are wrong.
>
> 3. Rejecting writes does not capture all places where the problem is
> occurring.  Only logging/metrics fully captures everywhere the problem is
> occurring.
>
>
> Not sure what you are saying here.
>
> nodes can be rejecting writes when they are in fact correct hence causing 
> “over-eager
> unavailability”.
>
>
> When would this occur?  I guess when the node with the bad ring
> information is a replica sent data from a coordinator with the correct ring
> state?  There would be no “unavailability” here unless there were multiple
> nodes in such a state.  I also again would not call this over eager,
> because the node with the bad ring state is f’ed up and needs to be fixed.
> So if being considered unavailable doesn’t seem over-eager to me.
>
> Given the fact that a user can read NEWS.txt and turn off this rejection
> of writes, I see no reason not to err on the side of “the setting which
> gives better protection even if it is not perfect”.  We should not let the
> want to solve everything prevent incremental improvements, especially when
> we actually do have the solution coming in TCM.
>
> -Jeremiah
>
> On Sep 12, 2024 at 5:25:25 PM, Mick Semb Wever  wrote:
>
>
> I'm less concerned with what the defaults are in each branch, and more the
> accuracy of

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Josh McKenzie
> Even when the fix is only partial, so really it's more about more forcefully 
> alerting the operator to the problem via over-eager unavailability …? 
> 
> Sometimes a principled stance can take us away from the important details in 
> the discussions.
My understanding of the ticket (having not dug deeply into the code, just 
reviewed the JIRA and this thread), is that this is as effective of a solution 
as we can have in a non-deterministic, non-epoch based, non-transactional 
metadata system. i.e. Gossip. I don't see it as a partial fix but I might be 
misunderstanding.

I'm not advocating for us having a rigid principled stance where we reject all 
nuance and don't discuss things. I'm advocating for us coalescing on a shared 
**default** stance of correctness unless otherwise excepted. We know we're a 
diverse group, we're all different people with different histories / values / 
opinions / cultures, and I think that's what makes this community as effective 
as it is.

But I **don't** think it's healthy for us to repeatedly re-litigate whether 
data loss is acceptable based on how long it's been around, or how frequently 
some of us on the project have observed some given phenomenon. My gut tells me 
we'd all be in a better place if we all started from 0 on a discussion like 
this as "Ok, data loss is unacceptable. Unless otherwise warranted, we should 
do all we can to fix this on all supported branches as our default response".

On Thu, Sep 12, 2024, at 9:02 PM, C. Scott Andreas wrote:
> Thanks all for discussion on this.
> 
> 
> 
> It’s hard to describe the sinking feeling that hit me when it became clear to 
> me how common this problem is - and how horribly difficult it is to prove one 
> has encountered this bug.
> 
> 
> 
> Two years ago, my understanding was that this is an exceptionally rare and 
> transient issue unlikely to occur after all the work we put into gossip. My 
> view was that gossip had basically been shorn up and that Transactional 
> Metadata is the proper fix for this with its epoch design (which is true).
> 
> 
> 
> Since that time, I’ve received several urgent messages from major users of 
> Apache Cassandra and even customers of Cassandra ecosystem vendors asking 
> about this bug. Some were able to verify the presence of lost data in 
> SSTables on nodes where it didn’t belong, demonstrate empty read responses 
> for data that is known proof-positive to exist (think content-addressable 
> stores), or reproduce this behavior in a local cluster after forcing 
> disagreement.
> 
> 
> 
> The severity and frequency of this issue combined with the business risk to 
> Apache Cassandra users changed my mind about fixing it in earlier branches 
> despite TCM having been merged to fix it for good on trunk.
> 
> 
> 
> The guards in this patch are extensive: point reads, range reads, mutations, 
> repair, incoming / outgoing streams, hints, merkle tree requests, and others 
> I’m forgetting. They’re simple guards, and while they touch many subsystems, 
> they’re not invasive changes.
> 
> 
> 
> There is no reasonable scenario that’s common enough that would justify 
> disabling a guard preventing silent data loss by default. I appreciate that a 
> prop exists to permit or warn in the presence of data loss for anyone who may 
> want that, in the spirit of users being in control of their clusters’ 
> behavior.
> 
> 
> 
> Very large operators may only see indications the guard took effect for a 
> handful of queries per day — but in instances where ownership disagreement is 
> prolonged, the patch is an essential guard against large-scale unrecoverable 
> data loss and incorrect responses to queries. I’ll further take the position 
> that those few queries in transient disagreement scenarios would be 
> justification by themselves.
> 
> 
> 
> I support merging the patch to all proposed branches and enabling the guard 
> by default.
> 
> 
> 
> – Scott
> 
> 
>> On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan  
>> wrote:
>> 
>>> 1. Rejecting writes does not prevent data loss in this situation.  It only 
>>> reduces it.  The investigation and remediation of possible mislocated data 
>>> is still required.
>> 
>> All nodes which reject a write prevent mislocated data.  There is still the 
>> possibility of some node having the same wrong view of the ring as the 
>> coordinator (including if they are the same node) accepting data.  Unless 
>> there are multiple nodes with the same wrong view of the ring, data loss is 
>> prevented for CL > ONE.
>> 
>>> 2. Rejecting writes is a louder form of alerting for users unaware of the 
>>> scenario, those not already monitoring logs or metrics.
>> 
>> Without this patch no one is aware of any issues at all.  Maybe you are 
>> referring to a situation where the patch is applied, but the default 
>> behavior is to still accept the “bad” data?  In that case yes, turning on 
>> rejection makes it “louder” in that your queries can fail if too many nodes 
>> ar

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread C. Scott Andreas
Thanks all for discussion on this.It’s hard to describe the sinking feeling that hit me when it became clear to me how common this problem is - and how horribly difficult it is to prove one has encountered this bug.Two years ago, my understanding was that this is an exceptionally rare and transient issue unlikely to occur after all the work we put into gossip. My view was that gossip had basically been shorn up and that Transactional Metadata is the proper fix for this with its epoch design (which is true).Since that time, I’ve received several urgent messages from major users of Apache Cassandra and even customers of Cassandra ecosystem vendors asking about this bug. Some were able to verify the presence of lost data in SSTables on nodes where it didn’t belong, demonstrate empty read responses for data that is known proof-positive to exist (think content-addressable stores), or reproduce this behavior in a local cluster after forcing disagreement.The severity and frequency of this issue combined with the business risk to Apache Cassandra users changed my mind about fixing it in earlier branches despite TCM having been merged to fix it for good on trunk.The guards in this patch are extensive: point reads, range reads, mutations, repair, incoming / outgoing streams, hints, merkle tree requests, and others I’m forgetting. They’re simple guards, and while they touch many subsystems, they’re not invasive changes.There is no reasonable scenario that’s common enough that would justify disabling a guard preventing silent data loss by default. I appreciate that a prop exists to permit or warn in the presence of data loss for anyone who may want that, in the spirit of users being in control of their clusters’ behavior.Very large operators may only see indications the guard took effect for a handful of queries per day — but in instances where ownership disagreement is prolonged, the patch is an essential guard against large-scale unrecoverable data loss and incorrect responses to queries. I’ll further take the position that those few queries in transient disagreement scenarios would be justification by themselves.I support merging the patch to all proposed branches and enabling the guard by default.– ScottOn Sep 12, 2024, at 3:40 PM, Jeremiah Jordan  wrote:
1. Rejecting writes does not prevent data loss in this situation.  It only reduces it.  The investigation and remediation of possible mislocated data is still required.
All nodes which reject a write prevent mislocated data.  There is still the possibility of some node having the same wrong view of the ring as the coordinator (including if they are the same node) accepting data.  Unless there are multiple nodes with the same wrong view of the ring, data loss is prevented for CL > ONE.2. Rejecting writes is a louder form of alerting for users unaware of the scenario, those not already monitoring logs or metrics.Without this patch no one is aware of any issues at all.  Maybe you are referring to a situation where the patch is applied, but the default behavior is to still accept the “bad” data?  In that case yes, turning on rejection makes it “louder” in that your queries can fail if too many nodes are wrong.3. Rejecting writes does not capture all places where the problem is occurring.  Only logging/metrics fully captures everywhere the problem is occurring.Not sure what you are saying here.
nodes can be rejecting writes when they are in fact correct hence causing “over-eager unavailability”.When would this occur?  I guess when the node with the bad ring information is a replica sent data from a coordinator with the correct ring state?  There would be no “unavailability” here unless there were multiple nodes in such a state.  I also again would not call this over eager, because the node with the bad ring state is f’ed up and needs to be fixed.  So if being considered unavailable doesn’t seem over-eager to me.Given the fact that a user can read NEWS.txt and turn off this rejection of writes, I see no reason not to err on the side of “the setting which gives better protection even if it is not perfect”.  We should not let the want to solve everything prevent incremental improvements, especially when we actually do have the solution coming in TCM.-Jeremiah

On Sep 12, 2024 at 5:25:25 PM, Mick Semb Wever  wrote:

I'm less concerned with what the defaults are in each branch, and more the accuracy of what we say, e.g. in NEWS.txtThis is my understanding so far, and where I hoped to be corrected.1. Rejecting writes does not prevent data loss in this situation.  It only reduces it.  The investigation and remediation of possible mislocated data is still required.2. Rejecting writes is a louder form of alerting for users unaware of the scenario, those not already monitoring logs or metrics.3. Rejecting writes does not capture all places where the problem is occurring.  Only logging/metrics fully captures everywhere the problem is occurri

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jeremiah Jordan
>
> 1. Rejecting writes does not prevent data loss in this situation.  It only
> reduces it.  The investigation and remediation of possible mislocated data
> is still required.
>

All nodes which reject a write prevent mislocated data.  There is still the
possibility of some node having the same wrong view of the ring as the
coordinator (including if they are the same node) accepting data.  Unless
there are multiple nodes with the same wrong view of the ring, data loss is
prevented for CL > ONE.

2. Rejecting writes is a louder form of alerting for users unaware of the
> scenario, those not already monitoring logs or metrics.
>

Without this patch no one is aware of any issues at all.  Maybe you are
referring to a situation where the patch is applied, but the default
behavior is to still accept the “bad” data?  In that case yes, turning on
rejection makes it “louder” in that your queries can fail if too many nodes
are wrong.

3. Rejecting writes does not capture all places where the problem is
> occurring.  Only logging/metrics fully captures everywhere the problem is
> occurring.
>

Not sure what you are saying here.

nodes can be rejecting writes when they are in fact correct hence
causing “over-eager
> unavailability”.
>

When would this occur?  I guess when the node with the bad ring information
is a replica sent data from a coordinator with the correct ring state?
There would be no “unavailability” here unless there were multiple nodes in
such a state.  I also again would not call this over eager, because the
node with the bad ring state is f’ed up and needs to be fixed.  So if being
considered unavailable doesn’t seem over-eager to me.

Given the fact that a user can read NEWS.txt and turn off this rejection of
writes, I see no reason not to err on the side of “the setting which gives
better protection even if it is not perfect”.  We should not let the want
to solve everything prevent incremental improvements, especially when we
actually do have the solution coming in TCM.

-Jeremiah

On Sep 12, 2024 at 5:25:25 PM, Mick Semb Wever  wrote:

>
> I'm less concerned with what the defaults are in each branch, and more the
> accuracy of what we say, e.g. in NEWS.txt
>
> This is my understanding so far, and where I hoped to be corrected.
>
> 1. Rejecting writes does not prevent data loss in this situation.  It only
> reduces it.  The investigation and remediation of possible mislocated data
> is still required.
>
> 2. Rejecting writes is a louder form of alerting for users unaware of the
> scenario, those not already monitoring logs or metrics.
>
> 3. Rejecting writes does not capture all places where the problem is
> occurring.  Only logging/metrics fully captures everywhere the problem is
> occurring.
>
> 4. This situation can be a consequence of other problems (C* or
> operational), not only range movements and the nature of gossip.
>
>
> (2) is the primary argument I see for setting rejection to default.  We
> need to inform the user that data mislocation can still be happening, and
> the only way to fully capture it is via monitoring of enabled
> logging/metrics.  We can also provide information about when range
> movements can cause this, and that nodes can be rejecting writes when they
> are in fact correct hence causing “over-eager unavailability”.  And
> furthermore, point people to TCM.
>
>
>
> On Thu, 12 Sept 2024 at 23:36, Jeremiah Jordan 
> wrote:
>
>> JD we know it had nothing to do with range movements and could/should
>>> have been prevented far simpler with operational correctness/checks.
>>>
>> “Be better” is not the answer.  Also I think you are confusing our
>> incidents, the out of range token issue we saw was not because of an
>> operational “oops” that could have been avoided.
>>
>> In the extreme, when no writes have gone to any of the replicas, what
>>> happened ? Either this was CL.*ONE, or it was an operational failure (not
>>> C* at fault).  If it's an operational fault, both the coordinator and the
>>> node can be wrong.  With CL.ONE, just the coordinator can be wrong and the
>>> problem still exists (and with rejection enabled the operator is now more
>>> likely to ignore it).
>>>
>>
>> If some node has a bad ring state it can easily send no writes to the
>> correct place, no need for CL ONE, with the current system behavior CL ALL
>> will be successful, with all the nodes sent a mutation happily accepting
>> and acking data they do not own.
>>
>> Yes, even with this patch if you are using CL ONE, if the coordinator has
>> a faulty ring state where no replica is “real” and it also decides that it
>> is one of the replicas, then you will have a successful write, even though
>> no correct node got the data.  If you are using CL ONE you already know you
>> are taking on a risk.  Not great, but there should be evidence in other
>> nodes of the bad thing occurring at the least.  Also for this same ring
>> state, for any CL > ONE with the patch the write would fail (assuming only
>> 

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Mick Semb Wever
I'm less concerned with what the defaults are in each branch, and more the
accuracy of what we say, e.g. in NEWS.txt

This is my understanding so far, and where I hoped to be corrected.

1. Rejecting writes does not prevent data loss in this situation.  It only
reduces it.  The investigation and remediation of possible mislocated data
is still required.

2. Rejecting writes is a louder form of alerting for users unaware of the
scenario, those not already monitoring logs or metrics.

3. Rejecting writes does not capture all places where the problem is
occurring.  Only logging/metrics fully captures everywhere the problem is
occurring.

4. This situation can be a consequence of other problems (C* or
operational), not only range movements and the nature of gossip.


(2) is the primary argument I see for setting rejection to default.  We
need to inform the user that data mislocation can still be happening, and
the only way to fully capture it is via monitoring of enabled
logging/metrics.  We can also provide information about when range
movements can cause this, and that nodes can be rejecting writes when they
are in fact correct hence causing “over-eager unavailability”.  And
furthermore, point people to TCM.



On Thu, 12 Sept 2024 at 23:36, Jeremiah Jordan 
wrote:

> JD we know it had nothing to do with range movements and could/should have
>> been prevented far simpler with operational correctness/checks.
>>
> “Be better” is not the answer.  Also I think you are confusing our
> incidents, the out of range token issue we saw was not because of an
> operational “oops” that could have been avoided.
>
> In the extreme, when no writes have gone to any of the replicas, what
>> happened ? Either this was CL.*ONE, or it was an operational failure (not
>> C* at fault).  If it's an operational fault, both the coordinator and the
>> node can be wrong.  With CL.ONE, just the coordinator can be wrong and the
>> problem still exists (and with rejection enabled the operator is now more
>> likely to ignore it).
>>
>
> If some node has a bad ring state it can easily send no writes to the
> correct place, no need for CL ONE, with the current system behavior CL ALL
> will be successful, with all the nodes sent a mutation happily accepting
> and acking data they do not own.
>
> Yes, even with this patch if you are using CL ONE, if the coordinator has
> a faulty ring state where no replica is “real” and it also decides that it
> is one of the replicas, then you will have a successful write, even though
> no correct node got the data.  If you are using CL ONE you already know you
> are taking on a risk.  Not great, but there should be evidence in other
> nodes of the bad thing occurring at the least.  Also for this same ring
> state, for any CL > ONE with the patch the write would fail (assuming only
> a single node has the bad ring state).
>
> Even when the fix is only partial, so really it's more about more
>> forcefully alerting the operator to the problem via over-eager
>> unavailability …?
>>
>
> Not sure why you are calling this “over-eager unavailability”.  If the
> data is going to the wrong nodes then the nodes may as well be down.
> Unless the end user is writing at CL ANY they have requested to be ACKed
> when CL nodes which own the data have acked getting it.
>
> -Jeremiah
>
> On Sep 12, 2024 at 2:35:01 PM, Mick Semb Wever  wrote:
>
>> Great that the discussion explores the issue as well.
>>
>> So far we've heard three* companies being impacted, and four times in
>> total…?  Info is helpful here.
>>
>> *) Jordan, you say you've been hit by _other_ bugs _like_ it.  Jon i'm
>> assuming the company you refer to doesn't overlap. JD we know it had
>> nothing to do with range movements and could/should have been prevented far
>> simpler with operational correctness/checks.
>>
>> In the extreme, when no writes have gone to any of the replicas, what
>> happened ? Either this was CL.*ONE, or it was an operational failure (not
>> C* at fault).  If it's an operational fault, both the coordinator and the
>> node can be wrong.  With CL.ONE, just the coordinator can be wrong and the
>> problem still exists (and with rejection enabled the operator is now more
>> likely to ignore it).
>>
>> WRT to the remedy, is it not to either run repair (when 1+ replica has
>> it), or to load flushed and recompacted sstables (from the period in
>> question) to their correct nodes.  This is not difficult, but
>> understandably lost-sleep and time-intensive.
>>
>> Neither of the above two points I feel are that material to the outcome,
>> but I think it helps keep the discussion on track and informative.   We
>> also know there are many competent operators out there that do detect data
>> loss.
>>
>>
>>
>> On Thu, 12 Sept 2024 at 20:07, Caleb Rackliffe 
>> wrote:
>>
>>> If we don’t reject by default, but log by default, my fear is that we’ll
>>> simply be alerting the operator to something that has already gone very
>>> wrong that they may not be 

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jeremiah Jordan
>
> JD we know it had nothing to do with range movements and could/should have
> been prevented far simpler with operational correctness/checks.
>
“Be better” is not the answer.  Also I think you are confusing our
incidents, the out of range token issue we saw was not because of an
operational “oops” that could have been avoided.

In the extreme, when no writes have gone to any of the replicas, what
> happened ? Either this was CL.*ONE, or it was an operational failure (not
> C* at fault).  If it's an operational fault, both the coordinator and the
> node can be wrong.  With CL.ONE, just the coordinator can be wrong and the
> problem still exists (and with rejection enabled the operator is now more
> likely to ignore it).
>

If some node has a bad ring state it can easily send no writes to the
correct place, no need for CL ONE, with the current system behavior CL ALL
will be successful, with all the nodes sent a mutation happily accepting
and acking data they do not own.

Yes, even with this patch if you are using CL ONE, if the coordinator has a
faulty ring state where no replica is “real” and it also decides that it is
one of the replicas, then you will have a successful write, even though no
correct node got the data.  If you are using CL ONE you already know you
are taking on a risk.  Not great, but there should be evidence in other
nodes of the bad thing occurring at the least.  Also for this same ring
state, for any CL > ONE with the patch the write would fail (assuming only
a single node has the bad ring state).

Even when the fix is only partial, so really it's more about more
> forcefully alerting the operator to the problem via over-eager
> unavailability …?
>

Not sure why you are calling this “over-eager unavailability”.  If the data
is going to the wrong nodes then the nodes may as well be down.  Unless the
end user is writing at CL ANY they have requested to be ACKed when CL nodes
which own the data have acked getting it.

-Jeremiah

On Sep 12, 2024 at 2:35:01 PM, Mick Semb Wever  wrote:

> Great that the discussion explores the issue as well.
>
> So far we've heard three* companies being impacted, and four times in
> total…?  Info is helpful here.
>
> *) Jordan, you say you've been hit by _other_ bugs _like_ it.  Jon i'm
> assuming the company you refer to doesn't overlap. JD we know it had
> nothing to do with range movements and could/should have been prevented far
> simpler with operational correctness/checks.
>
> In the extreme, when no writes have gone to any of the replicas, what
> happened ? Either this was CL.*ONE, or it was an operational failure (not
> C* at fault).  If it's an operational fault, both the coordinator and the
> node can be wrong.  With CL.ONE, just the coordinator can be wrong and the
> problem still exists (and with rejection enabled the operator is now more
> likely to ignore it).
>
> WRT to the remedy, is it not to either run repair (when 1+ replica has
> it), or to load flushed and recompacted sstables (from the period in
> question) to their correct nodes.  This is not difficult, but
> understandably lost-sleep and time-intensive.
>
> Neither of the above two points I feel are that material to the outcome,
> but I think it helps keep the discussion on track and informative.   We
> also know there are many competent operators out there that do detect data
> loss.
>
>
>
> On Thu, 12 Sept 2024 at 20:07, Caleb Rackliffe 
> wrote:
>
>> If we don’t reject by default, but log by default, my fear is that we’ll
>> simply be alerting the operator to something that has already gone very
>> wrong that they may not be in any position to ever address.
>>
>> On Sep 12, 2024, at 12:44 PM, Jordan West  wrote:
>>
>> 
>> I’m +1 on enabling rejection by default on all branches. We have been bit
>> by silent data loss (due to other bugs like the schema issues in 4.1) from
>> lack of rejection on several occasions and short of writing extremely
>> specialized tooling its unrecoverable. While both lack of availability and
>> data loss are critical, I will always pick lack of availability over data
>> loss. Its better to fail a write that will be lost than silently lose it.
>>
>> Of course, a change like this requires very good communication in
>> NEWS.txt and elsewhere but I think its well worth it. While it may surprise
>> some users I think they would be more surprised that they were silently
>> losing data.
>>
>> Jordan
>>
>> On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever  wrote:
>>
>>> Thanks for starting the thread Caleb, it is a big and impacting patch.
>>>
>>> Appreciate the criticality, in a new major release rejection by default
>>> is obvious.   Otherwise the logging and metrics is an important addition to
>>> help users validate the existence and degree of any problem.
>>>
>>> Also worth mentioning that rejecting writes can cause degraded
>>> availability in situations that pose no problem.  This is a coordination
>>> problem on a probabilistic design, it's choose your evi

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jordan West
To clarify my response:

We didn’t hit a bug “like it”. We hit a bug that resulted in an improper
view of the ring (on my phone so can’t dig up the JIRA but it was a ring
issue introduced in 4.1.0 and fixed in 4.1.4 iirc). There have been several
bugs of this form in the past. So it wasn’t that we were operationally
wrong. c* itself was wrong. And because this patch wasn’t present we had
silent data loss as a result. Even worse the scale of that data loss is
extremely hard to measure.

This incident is considered one of the top three most impactful /
significant incidents ever at my current employer — all three were data
loss related and two could’ve been prevented by this patch, again trading
availability instead of nrecoverable data loss that required setting up an
entirely new cluster and having a copy of the data in an external system
(not backups but literally an unrelated external system).

Put more succinctly: we got very lucky and had this patch been there and
rejection enabled we wouldn’t have needed luck.

Jordan

On Thu, Sep 12, 2024 at 12:36 Mick Semb Wever  wrote:

> Great that the discussion explores the issue as well.
>
> So far we've heard three* companies being impacted, and four times in
> total…?  Info is helpful here.
>
> *) Jordan, you say you've been hit by _other_ bugs _like_ it.  Jon i'm
> assuming the company you refer to doesn't overlap. JD we know it had
> nothing to do with range movements and could/should have been prevented far
> simpler with operational correctness/checks.
>
> In the extreme, when no writes have gone to any of the replicas, what
> happened ? Either this was CL.*ONE, or it was an operational failure (not
> C* at fault).  If it's an operational fault, both the coordinator and the
> node can be wrong.  With CL.ONE, just the coordinator can be wrong and the
> problem still exists (and with rejection enabled the operator is now more
> likely to ignore it).
>
> WRT to the remedy, is it not to either run repair (when 1+ replica has
> it), or to load flushed and recompacted sstables (from the period in
> question) to their correct nodes.  This is not difficult, but
> understandably lost-sleep and time-intensive.
>
> Neither of the above two points I feel are that material to the outcome,
> but I think it helps keep the discussion on track and informative.   We
> also know there are many competent operators out there that do detect data
> loss.
>
>
>
> On Thu, 12 Sept 2024 at 20:07, Caleb Rackliffe 
> wrote:
>
>> If we don’t reject by default, but log by default, my fear is that we’ll
>> simply be alerting the operator to something that has already gone very
>> wrong that they may not be in any position to ever address.
>>
>> On Sep 12, 2024, at 12:44 PM, Jordan West  wrote:
>>
>> 
>> I’m +1 on enabling rejection by default on all branches. We have been bit
>> by silent data loss (due to other bugs like the schema issues in 4.1) from
>> lack of rejection on several occasions and short of writing extremely
>> specialized tooling its unrecoverable. While both lack of availability and
>> data loss are critical, I will always pick lack of availability over data
>> loss. Its better to fail a write that will be lost than silently lose it.
>>
>> Of course, a change like this requires very good communication in
>> NEWS.txt and elsewhere but I think its well worth it. While it may surprise
>> some users I think they would be more surprised that they were silently
>> losing data.
>>
>> Jordan
>>
>> On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever  wrote:
>>
>>> Thanks for starting the thread Caleb, it is a big and impacting patch.
>>>
>>> Appreciate the criticality, in a new major release rejection by default
>>> is obvious.   Otherwise the logging and metrics is an important addition to
>>> help users validate the existence and degree of any problem.
>>>
>>> Also worth mentioning that rejecting writes can cause degraded
>>> availability in situations that pose no problem.  This is a coordination
>>> problem on a probabilistic design, it's choose your evil: unnecessary
>>> degraded availability or mislocated data (eventual data loss).   Logging
>>> and metrics makes alerting on and handling the data mislocation possible,
>>> i.e. avoids data loss with manual intervention.  (Logging and metrics also
>>> face the same problem with false positives.)
>>>
>>> I'm +0 for rejection default in 5.0.1, and +1 for only logging default
>>> in 4.x
>>>
>>>
>>> On Thu, 12 Sept 2024 at 18:56, Jeff Jirsa  wrote:
>>>
 This patch is so hard for me.

 The safety it adds is critical and should have been added a decade ago.
 Also it’s a huge patch, and touches “everything”.

 It definitely belongs in 5.0. I’d probably reject by default in 5.0.1.

 4.0 / 4.1 - if we treat this like a fix for latent opportunity for data
 loss (which it implicitly is), I guess?



 > On Sep 12, 2024, at 9:46 AM, Brandon Williams 
 wrote:
 >
 > On Thu, Sep 12

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread David Capwell
> if we are counting on users to read NEWS.txt, can we not count on them to 
> enable rejection if this is important to them?

I think we can make the inverse statement… if accepting data loss is a tradeoff 
they want then disabling is there for them?  So we could default for safety and 
let you opt-out if you prefer?

> On Sep 12, 2024, at 11:19 AM, Brandon Williams  wrote:
> 
> On Thu, Sep 12, 2024 at 1:13 PM Caleb Rackliffe
>  wrote:
>> 
>> I think I can count at least 4 people on this thread who literally have lost 
>> sleep over this.
> 
> Probably good examples of not being the majority though, heh.
> 
> If we are counting on users to read NEWS.txt, can we not count on them
> to enable rejection if this is important to them?
> 
> Kind Regards,
> Brandon



Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Mick Semb Wever
reply below

On Thu, 12 Sept 2024 at 21:56, Josh McKenzie  wrote:

> I'd like to propose we treat all data loss bugs as "fix by default on all
> supported branches even if that might introduce user-facing changes".
>
> Even if only N of M people on a thread have experienced it.
> Even if we only uncover it through testing (looking at you Harry).
>


And…
Even when the fix is only partial, so really it's more about more
forcefully alerting the operator to the problem via over-eager
unavailability …?

Sometimes a principled stance can take us away from the important details
in the discussions.  (note, i do accept doing this in 5.0, on the right
understanding. e.g. the importance of client retry when doing range
movements)


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Dinesh Joshi
My 2c are below –

We have a patch that is preventing a known data loss issue. People may or
may not know they're suffering from this issue so this should go in all
supported versions of Cassandra with it enabled by default. Will this cause
issues for operators? Sure. Is it worth keeping this feature off to avoid
issues for operators? No. We can mitigate any upgrade related issues by
putting in a warning in the release notes.

I'm +1 on this patch landing in all supported branches of Cassandra and
feature being on by default with adequate warnings for the operator.

Thanks,

Dinesh



On Thu, Sep 12, 2024 at 12:56 PM Josh McKenzie  wrote:

> I'd like to propose we treat all data loss bugs as "fix by default on all
> supported branches even if that might introduce user-facing changes".
>
> Even if only N of M people on a thread have experienced it.
> Even if we only uncover it through testing (looking at you Harry).
>
> My gut tells me this is something we should have a clear cultural value
> system around as a project, and that value system should be "Above all
> else, we don't lose data". Just because users aren't aware it might be
> happening doesn't mean it's not a *massive* problem.
>
> I would bet good money that there are *a lot* of user-felt pains using
> this project that we're all unfortunately insulated from.
>
> On Thu, Sep 12, 2024, at 3:35 PM, Mick Semb Wever wrote:
>
> Great that the discussion explores the issue as well.
>
> So far we've heard three* companies being impacted, and four times in
> total…?  Info is helpful here.
>
> *) Jordan, you say you've been hit by _other_ bugs _like_ it.  Jon i'm
> assuming the company you refer to doesn't overlap. JD we know it had
> nothing to do with range movements and could/should have been prevented far
> simpler with operational correctness/checks.
>
> In the extreme, when no writes have gone to any of the replicas, what
> happened ? Either this was CL.*ONE, or it was an operational failure (not
> C* at fault).  If it's an operational fault, both the coordinator and the
> node can be wrong.  With CL.ONE, just the coordinator can be wrong and the
> problem still exists (and with rejection enabled the operator is now more
> likely to ignore it).
>
> WRT to the remedy, is it not to either run repair (when 1+ replica has
> it), or to load flushed and recompacted sstables (from the period in
> question) to their correct nodes.  This is not difficult, but
> understandably lost-sleep and time-intensive.
>
> Neither of the above two points I feel are that material to the outcome,
> but I think it helps keep the discussion on track and informative.   We
> also know there are many competent operators out there that do detect data
> loss.
>
>
>
> On Thu, 12 Sept 2024 at 20:07, Caleb Rackliffe 
> wrote:
>
> If we don’t reject by default, but log by default, my fear is that we’ll
> simply be alerting the operator to something that has already gone very
> wrong that they may not be in any position to ever address.
>
> On Sep 12, 2024, at 12:44 PM, Jordan West  wrote:
>
> 
> I’m +1 on enabling rejection by default on all branches. We have been bit
> by silent data loss (due to other bugs like the schema issues in 4.1) from
> lack of rejection on several occasions and short of writing extremely
> specialized tooling its unrecoverable. While both lack of availability and
> data loss are critical, I will always pick lack of availability over data
> loss. Its better to fail a write that will be lost than silently lose it.
>
> Of course, a change like this requires very good communication in NEWS.txt
> and elsewhere but I think its well worth it. While it may surprise some
> users I think they would be more surprised that they were silently losing
> data.
>
> Jordan
>
> On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever  wrote:
>
> Thanks for starting the thread Caleb, it is a big and impacting patch.
>
> Appreciate the criticality, in a new major release rejection by default is
> obvious.   Otherwise the logging and metrics is an important addition to
> help users validate the existence and degree of any problem.
>
> Also worth mentioning that rejecting writes can cause degraded
> availability in situations that pose no problem.  This is a coordination
> problem on a probabilistic design, it's choose your evil: unnecessary
> degraded availability or mislocated data (eventual data loss).   Logging
> and metrics makes alerting on and handling the data mislocation possible,
> i.e. avoids data loss with manual intervention.  (Logging and metrics also
> face the same problem with false positives.)
>
> I'm +0 for rejection default in 5.0.1, and +1 for only logging default in
> 4.x
>
>
> On Thu, 12 Sept 2024 at 18:56, Jeff Jirsa  wrote:
>
> This patch is so hard for me.
>
> The safety it adds is critical and should have been added a decade ago.
> Also it’s a huge patch, and touches “everything”.
>
> It definitely belongs in 5.0. I’d probably reject by default in 5.0.1.
>
> 4.0 

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Josh McKenzie
I'd like to propose we treat all data loss bugs as "fix by default on all 
supported branches even if that might introduce user-facing changes".

Even if only N of M people on a thread have experienced it.
Even if we only uncover it through testing (looking at you Harry).

My gut tells me this is something we should have a clear cultural value system 
around as a project, and that value system should be "Above all else, we don't 
lose data". Just because users aren't aware it might be happening doesn't mean 
it's not a *massive* problem.

I would bet good money that there are *a lot* of user-felt pains using this 
project that we're all unfortunately insulated from. 

On Thu, Sep 12, 2024, at 3:35 PM, Mick Semb Wever wrote:
> Great that the discussion explores the issue as well.
> 
> So far we've heard three* companies being impacted, and four times in total…? 
>  Info is helpful here.  
> 
> *) Jordan, you say you've been hit by _other_ bugs _like_ it.  Jon i'm 
> assuming the company you refer to doesn't overlap. JD we know it had nothing 
> to do with range movements and could/should have been prevented far simpler 
> with operational correctness/checks.
> 
> In the extreme, when no writes have gone to any of the replicas, what 
> happened ? Either this was CL.*ONE, or it was an operational failure (not C* 
> at fault).  If it's an operational fault, both the coordinator and the node 
> can be wrong.  With CL.ONE, just the coordinator can be wrong and the problem 
> still exists (and with rejection enabled the operator is now more likely to 
> ignore it).
> 
> WRT to the remedy, is it not to either run repair (when 1+ replica has it), 
> or to load flushed and recompacted sstables (from the period in question) to 
> their correct nodes.  This is not difficult, but understandably lost-sleep 
> and time-intensive.
> 
> Neither of the above two points I feel are that material to the outcome, but 
> I think it helps keep the discussion on track and informative.   We also know 
> there are many competent operators out there that do detect data loss.
> 
> 
> 
> On Thu, 12 Sept 2024 at 20:07, Caleb Rackliffe  
> wrote:
>> If we don’t reject by default, but log by default, my fear is that we’ll 
>> simply be alerting the operator to something that has already gone very 
>> wrong that they may not be in any position to ever address.
>> 
>>> On Sep 12, 2024, at 12:44 PM, Jordan West  wrote:
>>> 
>>> I’m +1 on enabling rejection by default on all branches. We have been bit 
>>> by silent data loss (due to other bugs like the schema issues in 4.1) from 
>>> lack of rejection on several occasions and short of writing extremely 
>>> specialized tooling its unrecoverable. While both lack of availability and 
>>> data loss are critical, I will always pick lack of availability over data 
>>> loss. Its better to fail a write that will be lost than silently lose it. 
>>> 
>>> Of course, a change like this requires very good communication in NEWS.txt 
>>> and elsewhere but I think its well worth it. While it may surprise some 
>>> users I think they would be more surprised that they were silently losing 
>>> data. 
>>> 
>>> Jordan 
>>> 
>>> On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever  wrote:
 Thanks for starting the thread Caleb, it is a big and impacting patch.
 
 Appreciate the criticality, in a new major release rejection by default is 
 obvious.   Otherwise the logging and metrics is an important addition to 
 help users validate the existence and degree of any problem.  
 
 Also worth mentioning that rejecting writes can cause degraded 
 availability in situations that pose no problem.  This is a coordination 
 problem on a probabilistic design, it's choose your evil: unnecessary 
 degraded availability or mislocated data (eventual data loss).   Logging 
 and metrics makes alerting on and handling the data mislocation possible, 
 i.e. avoids data loss with manual intervention.  (Logging and metrics also 
 face the same problem with false positives.)
 
 I'm +0 for rejection default in 5.0.1, and +1 for only logging default in 
 4.x
 
 
 On Thu, 12 Sept 2024 at 18:56, Jeff Jirsa  wrote:
> This patch is so hard for me. 
> 
> The safety it adds is critical and should have been added a decade ago.
> Also it’s a huge patch, and touches “everything”. 
> 
> It definitely belongs in 5.0. I’d probably reject by default in 5.0.1.  
> 
> 4.0 / 4.1 - if we treat this like a fix for latent opportunity for data 
> loss (which it implicitly is), I guess?
> 
> 
> 
> > On Sep 12, 2024, at 9:46 AM, Brandon Williams  wrote:
> > 
> > On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
> >  wrote:
> >> 
> >> Are you opposed to the patch in its entirety, or just rejecting unsafe 
> >> operations by default?
> > 
> > I had the latter in mind.  Changing any default in a patch release i

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Ekaterina Dimitrova
I agree we should be precise in the docs after people share their opinion
and experience in this thread and the ticket work gets settled. Thank you
Caleb for opening it! It is important

On Thu, 12 Sep 2024 at 15:49, Mick Semb Wever  wrote:

> Yes, and my usage of CL.*ONE wasn't so correct (as writes are
> backgrounded), but…
> The point is we should be accurate and precise in talking about this (folk
> will come back and read this thread), both in how it can manifest, the
> limitations of both logging and rejection, what possible remedies are, and
> far simpler ways of ensuring cluster correctness (e.g. nodetool status and
> gossipinfo checks).
>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Mick Semb Wever
Yes, and my usage of CL.*ONE wasn't so correct (as writes are
backgrounded), but…
The point is we should be accurate and precise in talking about this (folk
will come back and read this thread), both in how it can manifest, the
limitations of both logging and rejection, what possible remedies are, and
far simpler ways of ensuring cluster correctness (e.g. nodetool status and
gossipinfo checks).


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jeff Jirsa
First, Any violation of consistency should be treated as data loss, because we can’t tell what people are doing as a result of the missing data downstream (it may trigger an action outside of the database that is unrecoverable)Second, If you have a coordinator with a broken snitch configured, it may misunderstand DCs or racks and send all data to the wrong place regardless of CL used. “Someone should have noticed” yea, but we silently let one bad coordinator do horribly wrong things today. On Sep 12, 2024, at 12:35 PM, Mick Semb Wever  wrote:Great that the discussion explores the issue as well.So far we've heard three* companies being impacted, and four times in total…?  Info is helpful here.  *) Jordan, you say you've been hit by _other_ bugs _like_ it.  Jon i'm assuming the company you refer to doesn't overlap. JD we know it had nothing to do with range movements and could/should have been prevented far simpler with operational correctness/checks.In the extreme, when no writes have gone to any of the replicas, what happened ? Either this was CL.*ONE, or it was an operational failure (not C* at fault).  If it's an operational fault, both the coordinator and the node can be wrong.  With CL.ONE, just the coordinator can be wrong and the problem still exists (and with rejection enabled the operator is now more likely to ignore it).WRT to the remedy, is it not to either run repair (when 1+ replica has it), or to load flushed and recompacted sstables (from the period in question) to their correct nodes.  This is not difficult, but understandably lost-sleep and time-intensive.Neither of the above two points I feel are that material to the outcome, but I think it helps keep the discussion on track and informative.   We also know there are many competent operators out there that do detect data loss.On Thu, 12 Sept 2024 at 20:07, Caleb Rackliffe  wrote:If we don’t reject by default, but log by default, my fear is that we’ll simply be alerting the operator to something that has already gone very wrong that they may not be in any position to ever address.On Sep 12, 2024, at 12:44 PM, Jordan West  wrote:I’m +1 on enabling rejection by default on all branches. We have been bit by silent data loss (due to other bugs like the schema issues in 4.1) from lack of rejection on several occasions and short of writing extremely specialized tooling its unrecoverable. While both lack of availability and data loss are critical, I will always pick lack of availability over data loss. Its better to fail a write that will be lost than silently lose it. Of course, a change like this requires very good communication in NEWS.txt and elsewhere but I think its well worth it. While it may surprise some users I think they would be more surprised that they were silently losing data. Jordan On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever  wrote:Thanks for starting the thread Caleb, it is a big and impacting patch.Appreciate the criticality, in a new major release rejection by default is obvious.   Otherwise the logging and metrics is an important addition to help users validate the existence and degree of any problem.  Also worth mentioning that rejecting writes can cause degraded availability in situations that pose no problem.  This is a coordination problem on a probabilistic design, it's choose your evil: unnecessary degraded availability or mislocated data (eventual data loss).   Logging and metrics makes alerting on and handling the data mislocation possible, i.e. avoids data loss with manual intervention.  (Logging and metrics also face the same problem with false positives.)I'm +0 for rejection default in 5.0.1, and +1 for only logging default in 4.xOn Thu, 12 Sept 2024 at 18:56, Jeff Jirsa  wrote:This patch is so hard for me. 

The safety it adds is critical and should have been added a decade ago.
Also it’s a huge patch, and touches “everything”. 

It definitely belongs in 5.0. I’d probably reject by default in 5.0.1.  

4.0 / 4.1 - if we treat this like a fix for latent opportunity for data loss (which it implicitly is), I guess?



> On Sep 12, 2024, at 9:46 AM, Brandon Williams  wrote:
> 
> On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
>  wrote:
>> 
>> Are you opposed to the patch in its entirety, or just rejecting unsafe operations by default?
> 
> I had the latter in mind.  Changing any default in a patch release is
> a potential surprise for operators and one of this nature especially
> so.
> 
> Kind Regards,
> Brandon






Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Mick Semb Wever
Great that the discussion explores the issue as well.

So far we've heard three* companies being impacted, and four times in
total…?  Info is helpful here.

*) Jordan, you say you've been hit by _other_ bugs _like_ it.  Jon i'm
assuming the company you refer to doesn't overlap. JD we know it had
nothing to do with range movements and could/should have been prevented far
simpler with operational correctness/checks.

In the extreme, when no writes have gone to any of the replicas, what
happened ? Either this was CL.*ONE, or it was an operational failure (not
C* at fault).  If it's an operational fault, both the coordinator and the
node can be wrong.  With CL.ONE, just the coordinator can be wrong and the
problem still exists (and with rejection enabled the operator is now more
likely to ignore it).

WRT to the remedy, is it not to either run repair (when 1+ replica has it),
or to load flushed and recompacted sstables (from the period in question)
to their correct nodes.  This is not difficult, but understandably
lost-sleep and time-intensive.

Neither of the above two points I feel are that material to the outcome,
but I think it helps keep the discussion on track and informative.   We
also know there are many competent operators out there that do detect data
loss.



On Thu, 12 Sept 2024 at 20:07, Caleb Rackliffe 
wrote:

> If we don’t reject by default, but log by default, my fear is that we’ll
> simply be alerting the operator to something that has already gone very
> wrong that they may not be in any position to ever address.
>
> On Sep 12, 2024, at 12:44 PM, Jordan West  wrote:
>
> 
> I’m +1 on enabling rejection by default on all branches. We have been bit
> by silent data loss (due to other bugs like the schema issues in 4.1) from
> lack of rejection on several occasions and short of writing extremely
> specialized tooling its unrecoverable. While both lack of availability and
> data loss are critical, I will always pick lack of availability over data
> loss. Its better to fail a write that will be lost than silently lose it.
>
> Of course, a change like this requires very good communication in NEWS.txt
> and elsewhere but I think its well worth it. While it may surprise some
> users I think they would be more surprised that they were silently losing
> data.
>
> Jordan
>
> On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever  wrote:
>
>> Thanks for starting the thread Caleb, it is a big and impacting patch.
>>
>> Appreciate the criticality, in a new major release rejection by default
>> is obvious.   Otherwise the logging and metrics is an important addition to
>> help users validate the existence and degree of any problem.
>>
>> Also worth mentioning that rejecting writes can cause degraded
>> availability in situations that pose no problem.  This is a coordination
>> problem on a probabilistic design, it's choose your evil: unnecessary
>> degraded availability or mislocated data (eventual data loss).   Logging
>> and metrics makes alerting on and handling the data mislocation possible,
>> i.e. avoids data loss with manual intervention.  (Logging and metrics also
>> face the same problem with false positives.)
>>
>> I'm +0 for rejection default in 5.0.1, and +1 for only logging default in
>> 4.x
>>
>>
>> On Thu, 12 Sept 2024 at 18:56, Jeff Jirsa  wrote:
>>
>>> This patch is so hard for me.
>>>
>>> The safety it adds is critical and should have been added a decade ago.
>>> Also it’s a huge patch, and touches “everything”.
>>>
>>> It definitely belongs in 5.0. I’d probably reject by default in 5.0.1.
>>>
>>> 4.0 / 4.1 - if we treat this like a fix for latent opportunity for data
>>> loss (which it implicitly is), I guess?
>>>
>>>
>>>
>>> > On Sep 12, 2024, at 9:46 AM, Brandon Williams 
>>> wrote:
>>> >
>>> > On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
>>> >  wrote:
>>> >>
>>> >> Are you opposed to the patch in its entirety, or just rejecting
>>> unsafe operations by default?
>>> >
>>> > I had the latter in mind.  Changing any default in a patch release is
>>> > a potential surprise for operators and one of this nature especially
>>> > so.
>>> >
>>> > Kind Regards,
>>> > Brandon
>>>
>>>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jeff Jirsa
On Sep 12, 2024, at 12:22 PM, J. D. Jordan  wrote:I have lost sleep (and data) over this multiple times in the past few months, that was only recently tracked down to this exact scenario.+1 for including it in all active releases and enabling the failure of the writes on “wrong” nodes by default.I haven’t looked at the patch, but as long as only one node of three gets the data by mistake and fails, I would expect a LOCAL_QUORUM write with two ack one fail to still succeed?Yes. And most people won’t notice that. It’s just instantly safe and everyone will be happy. What people will notice is that things like concurrent overlapping bootstrap will start failing because sstables will be the wrong range . Or repair during range movements. -JeremiahOn Sep 12, 2024, at 1:47 PM, Jordan West  wrote:I think folks not losing sleep over this are only in that position because they don’t know it’s happening. Like Brandon said, ignorance is bliss (but it’s a false bliss).  Very few users do the work necessary to detect data loss outside the obvious paths. I agree with Caleb, if we log and give them no means to remediate we are giving them nightmares with no recourse. While failed writes will be a surprise it’s the correct solution because it’s the only one that prevents data loss which we should always strive to get rid of. JordanOn Thu, Sep 12, 2024 at 11:31 Caleb Rackliffe  wrote:We aren’t counting on users to read NEWS.txt. That’s the point. We’re saying we’re going to make things safer, as they should always have been, and if someone out there has tooling that somehow allows them to avoid the risks, they can disable rejection.

> On Sep 12, 2024, at 1:21 PM, Brandon Williams  wrote:
> 
> On Thu, Sep 12, 2024 at 1:13 PM Caleb Rackliffe
>  wrote:
>> 
>> I think I can count at least 4 people on this thread who literally have lost sleep over this.
> 
> Probably good examples of not being the majority though, heh.
> 
> If we are counting on users to read NEWS.txt, can we not count on them
> to enable rejection if this is important to them?
> 
> Kind Regards,
> Brandon



Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread J. D. Jordan
I have lost sleep (and data) over this multiple times in the past few months, that was only recently tracked down to this exact scenario.+1 for including it in all active releases and enabling the failure of the writes on “wrong” nodes by default.I haven’t looked at the patch, but as long as only one node of three gets the data by mistake and fails, I would expect a LOCAL_QUORUM write with two ack one fail to still succeed?-JeremiahOn Sep 12, 2024, at 1:47 PM, Jordan West  wrote:I think folks not losing sleep over this are only in that position because they don’t know it’s happening. Like Brandon said, ignorance is bliss (but it’s a false bliss).  Very few users do the work necessary to detect data loss outside the obvious paths. I agree with Caleb, if we log and give them no means to remediate we are giving them nightmares with no recourse. While failed writes will be a surprise it’s the correct solution because it’s the only one that prevents data loss which we should always strive to get rid of. JordanOn Thu, Sep 12, 2024 at 11:31 Caleb Rackliffe  wrote:We aren’t counting on users to read NEWS.txt. That’s the point. We’re saying we’re going to make things safer, as they should always have been, and if someone out there has tooling that somehow allows them to avoid the risks, they can disable rejection.

> On Sep 12, 2024, at 1:21 PM, Brandon Williams  wrote:
> 
> On Thu, Sep 12, 2024 at 1:13 PM Caleb Rackliffe
>  wrote:
>> 
>> I think I can count at least 4 people on this thread who literally have lost sleep over this.
> 
> Probably good examples of not being the majority though, heh.
> 
> If we are counting on users to read NEWS.txt, can we not count on them
> to enable rejection if this is important to them?
> 
> Kind Regards,
> Brandon



Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jordan West
I think folks not losing sleep over this are only in that position because
they don’t know it’s happening. Like Brandon said, ignorance is bliss (but
it’s a false bliss).

Very few users do the work necessary to detect data loss outside the
obvious paths. I agree with Caleb, if we log and give them no means to
remediate we are giving them nightmares with no recourse. While failed
writes will be a surprise it’s the correct solution because it’s the only
one that prevents data loss which we should always strive to get rid of.

Jordan

On Thu, Sep 12, 2024 at 11:31 Caleb Rackliffe 
wrote:

> We aren’t counting on users to read NEWS.txt. That’s the point. We’re
> saying we’re going to make things safer, as they should always have been,
> and if someone out there has tooling that somehow allows them to avoid the
> risks, they can disable rejection.
>
> > On Sep 12, 2024, at 1:21 PM, Brandon Williams  wrote:
> >
> > On Thu, Sep 12, 2024 at 1:13 PM Caleb Rackliffe
> >  wrote:
> >>
> >> I think I can count at least 4 people on this thread who literally have
> lost sleep over this.
> >
> > Probably good examples of not being the majority though, heh.
> >
> > If we are counting on users to read NEWS.txt, can we not count on them
> > to enable rejection if this is important to them?
> >
> > Kind Regards,
> > Brandon
>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Caleb Rackliffe
We aren’t counting on users to read NEWS.txt. That’s the point. We’re saying 
we’re going to make things safer, as they should always have been, and if 
someone out there has tooling that somehow allows them to avoid the risks, they 
can disable rejection.

> On Sep 12, 2024, at 1:21 PM, Brandon Williams  wrote:
> 
> On Thu, Sep 12, 2024 at 1:13 PM Caleb Rackliffe
>  wrote:
>> 
>> I think I can count at least 4 people on this thread who literally have lost 
>> sleep over this.
> 
> Probably good examples of not being the majority though, heh.
> 
> If we are counting on users to read NEWS.txt, can we not count on them
> to enable rejection if this is important to them?
> 
> Kind Regards,
> Brandon


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Doug Rohrer
+1 on rejection-by-default, for several reasons:

1) Jordan’s point on the fact that recovery from this kind of data misplacement 
is very difficult.
2) Without any sort of warning or error in existing Cassandra installations, 
how many operators/users would actually know that they have been hit by this 
particular issue in the past? My guess is that the folks who have actually 
identified instances of this kind of data loss are ones who have a significant 
amount of experience running Cassandra and a team that has the ability to track 
down these kinds of issues, where most users may have never even known this was 
happening.

If we only warn/log, most folks will likely either not even see the issue (we 
log a lot) or not know what to do when it happens.

Doug

> On Sep 12, 2024, at 2:19 PM, Brandon Williams  wrote:
> 
> On Thu, Sep 12, 2024 at 1:13 PM Caleb Rackliffe
>  wrote:
>> 
>> I think I can count at least 4 people on this thread who literally have lost 
>> sleep over this.
> 
> Probably good examples of not being the majority though, heh.
> 
> If we are counting on users to read NEWS.txt, can we not count on them
> to enable rejection if this is important to them?
> 
> Kind Regards,
> Brandon



Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Brandon Williams
On Thu, Sep 12, 2024 at 1:13 PM Caleb Rackliffe
 wrote:
>
> I think I can count at least 4 people on this thread who literally have lost 
> sleep over this.

Probably good examples of not being the majority though, heh.

If we are counting on users to read NEWS.txt, can we not count on them
to enable rejection if this is important to them?

Kind Regards,
Brandon


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Abe Ratnofsky
Expressing another vote in favor of rejection-by-default. If a user doesn't 
want to lose sleep for data loss while on-call, they can read NEWS.txt and 
disable rejection.



Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Cheng Wang via dev
> If we don’t reject by default, but log by default, my fear is that we’ll
simply be alerting the operator to something that has already gone very
wrong that they may not be in any position to ever address.
Yes, logging and alerting is not enough here. We have seen the same issue
before that we have to rely on automation immediately instead of paging
oncalls to react.

On Thu, Sep 12, 2024 at 11:07 AM Caleb Rackliffe 
wrote:

> If we don’t reject by default, but log by default, my fear is that we’ll
> simply be alerting the operator to something that has already gone very
> wrong that they may not be in any position to ever address.
>
> On Sep 12, 2024, at 12:44 PM, Jordan West  wrote:
>
> 
> I’m +1 on enabling rejection by default on all branches. We have been bit
> by silent data loss (due to other bugs like the schema issues in 4.1) from
> lack of rejection on several occasions and short of writing extremely
> specialized tooling its unrecoverable. While both lack of availability and
> data loss are critical, I will always pick lack of availability over data
> loss. Its better to fail a write that will be lost than silently lose it.
>
> Of course, a change like this requires very good communication in NEWS.txt
> and elsewhere but I think its well worth it. While it may surprise some
> users I think they would be more surprised that they were silently losing
> data.
>
> Jordan
>
> On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever  wrote:
>
>> Thanks for starting the thread Caleb, it is a big and impacting patch.
>>
>> Appreciate the criticality, in a new major release rejection by default
>> is obvious.   Otherwise the logging and metrics is an important addition to
>> help users validate the existence and degree of any problem.
>>
>> Also worth mentioning that rejecting writes can cause degraded
>> availability in situations that pose no problem.  This is a coordination
>> problem on a probabilistic design, it's choose your evil: unnecessary
>> degraded availability or mislocated data (eventual data loss).   Logging
>> and metrics makes alerting on and handling the data mislocation possible,
>> i.e. avoids data loss with manual intervention.  (Logging and metrics also
>> face the same problem with false positives.)
>>
>> I'm +0 for rejection default in 5.0.1, and +1 for only logging default in
>> 4.x
>>
>>
>> On Thu, 12 Sept 2024 at 18:56, Jeff Jirsa  wrote:
>>
>>> This patch is so hard for me.
>>>
>>> The safety it adds is critical and should have been added a decade ago.
>>> Also it’s a huge patch, and touches “everything”.
>>>
>>> It definitely belongs in 5.0. I’d probably reject by default in 5.0.1.
>>>
>>> 4.0 / 4.1 - if we treat this like a fix for latent opportunity for data
>>> loss (which it implicitly is), I guess?
>>>
>>>
>>>
>>> > On Sep 12, 2024, at 9:46 AM, Brandon Williams 
>>> wrote:
>>> >
>>> > On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
>>> >  wrote:
>>> >>
>>> >> Are you opposed to the patch in its entirety, or just rejecting
>>> unsafe operations by default?
>>> >
>>> > I had the latter in mind.  Changing any default in a patch release is
>>> > a potential surprise for operators and one of this nature especially
>>> > so.
>>> >
>>> > Kind Regards,
>>> > Brandon
>>>
>>>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Caleb Rackliffe
I think I can count at least 4 people on this thread who literally have lost 
sleep over this.

> On Sep 12, 2024, at 1:07 PM, Brandon Williams  wrote:
> 
> On Thu, Sep 12, 2024 at 11:52 AM Josh McKenzie  
> wrote:
>> 
>> More or less surprising than learning that they've been at risk of or 
>> actively losing data for years?
> 
> Ignorance is bliss though and I think the majority of users haven't
> lost sleep over this since it's been present since the beginning of
> time.  Changing behavior in old stable branches can interrupt that,
> especially if a regression is introduced, which is always a
> possibility.  Logging will provide awareness and I think that's going
> far enough on our side, and users who are concerned can go further on
> theirs.
> 
> Kind Regards,
> Brandon


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jon Haddad
I have worked with teams that have lost weeks of sleep, and customers, due
to these issues.  It cost a fortune 500 company that I work with millions
in revenue.  I think ignorance is bliss may also apply to us, if we are
unaware of the number of times C* users have been bit by this issue but not
raised a big stink on the ML.

I lean towards applying the patch to all branches.

Jon


On Thu, Sep 12, 2024 at 11:07 AM Brandon Williams  wrote:

> On Thu, Sep 12, 2024 at 11:52 AM Josh McKenzie 
> wrote:
> >
> > More or less surprising than learning that they've been at risk of or
> actively losing data for years?
>
> Ignorance is bliss though and I think the majority of users haven't
> lost sleep over this since it's been present since the beginning of
> time.  Changing behavior in old stable branches can interrupt that,
> especially if a regression is introduced, which is always a
> possibility.  Logging will provide awareness and I think that's going
> far enough on our side, and users who are concerned can go further on
> theirs.
>
> Kind Regards,
> Brandon
>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Brandon Williams
On Thu, Sep 12, 2024 at 11:52 AM Josh McKenzie  wrote:
>
> More or less surprising than learning that they've been at risk of or 
> actively losing data for years?

Ignorance is bliss though and I think the majority of users haven't
lost sleep over this since it's been present since the beginning of
time.  Changing behavior in old stable branches can interrupt that,
especially if a regression is introduced, which is always a
possibility.  Logging will provide awareness and I think that's going
far enough on our side, and users who are concerned can go further on
theirs.

Kind Regards,
Brandon


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Caleb Rackliffe
If we don’t reject by default, but log by default, my fear is that we’ll simply be alerting the operator to something that has already gone very wrong that they may not be in any position to ever address.On Sep 12, 2024, at 12:44 PM, Jordan West  wrote:I’m +1 on enabling rejection by default on all branches. We have been bit by silent data loss (due to other bugs like the schema issues in 4.1) from lack of rejection on several occasions and short of writing extremely specialized tooling its unrecoverable. While both lack of availability and data loss are critical, I will always pick lack of availability over data loss. Its better to fail a write that will be lost than silently lose it. Of course, a change like this requires very good communication in NEWS.txt and elsewhere but I think its well worth it. While it may surprise some users I think they would be more surprised that they were silently losing data. Jordan On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever  wrote:Thanks for starting the thread Caleb, it is a big and impacting patch.Appreciate the criticality, in a new major release rejection by default is obvious.   Otherwise the logging and metrics is an important addition to help users validate the existence and degree of any problem.  Also worth mentioning that rejecting writes can cause degraded availability in situations that pose no problem.  This is a coordination problem on a probabilistic design, it's choose your evil: unnecessary degraded availability or mislocated data (eventual data loss).   Logging and metrics makes alerting on and handling the data mislocation possible, i.e. avoids data loss with manual intervention.  (Logging and metrics also face the same problem with false positives.)I'm +0 for rejection default in 5.0.1, and +1 for only logging default in 4.xOn Thu, 12 Sept 2024 at 18:56, Jeff Jirsa  wrote:This patch is so hard for me. 

The safety it adds is critical and should have been added a decade ago.
Also it’s a huge patch, and touches “everything”. 

It definitely belongs in 5.0. I’d probably reject by default in 5.0.1.  

4.0 / 4.1 - if we treat this like a fix for latent opportunity for data loss (which it implicitly is), I guess?



> On Sep 12, 2024, at 9:46 AM, Brandon Williams  wrote:
> 
> On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
>  wrote:
>> 
>> Are you opposed to the patch in its entirety, or just rejecting unsafe operations by default?
> 
> I had the latter in mind.  Changing any default in a patch release is
> a potential surprise for operators and one of this nature especially
> so.
> 
> Kind Regards,
> Brandon





Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Cheng Wang via dev
I am +1 with enabling rejection by default. We had encountered similar
situations before that we lost data in silence, which made us create a
patch to trade availability with data loss.
While I agree that it might be a surprise to operators, I think it's worth
having good communication in the NEWS.txt and logging the exceptions
explicitly. That said, we might create a one-time surprise instead of
losing data over and over silently.

On Thu, Sep 12, 2024 at 10:44 AM Jordan West  wrote:

> I’m +1 on enabling rejection by default on all branches. We have been bit
> by silent data loss (due to other bugs like the schema issues in 4.1) from
> lack of rejection on several occasions and short of writing extremely
> specialized tooling its unrecoverable. While both lack of availability and
> data loss are critical, I will always pick lack of availability over data
> loss. Its better to fail a write that will be lost than silently lose it.
>
> Of course, a change like this requires very good communication in NEWS.txt
> and elsewhere but I think its well worth it. While it may surprise some
> users I think they would be more surprised that they were silently losing
> data.
>
> Jordan
>
> On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever  wrote:
>
>> Thanks for starting the thread Caleb, it is a big and impacting patch.
>>
>> Appreciate the criticality, in a new major release rejection by default
>> is obvious.   Otherwise the logging and metrics is an important addition to
>> help users validate the existence and degree of any problem.
>>
>> Also worth mentioning that rejecting writes can cause degraded
>> availability in situations that pose no problem.  This is a coordination
>> problem on a probabilistic design, it's choose your evil: unnecessary
>> degraded availability or mislocated data (eventual data loss).   Logging
>> and metrics makes alerting on and handling the data mislocation possible,
>> i.e. avoids data loss with manual intervention.  (Logging and metrics also
>> face the same problem with false positives.)
>>
>> I'm +0 for rejection default in 5.0.1, and +1 for only logging default in
>> 4.x
>>
>>
>> On Thu, 12 Sept 2024 at 18:56, Jeff Jirsa  wrote:
>>
>>> This patch is so hard for me.
>>>
>>> The safety it adds is critical and should have been added a decade ago.
>>> Also it’s a huge patch, and touches “everything”.
>>>
>>> It definitely belongs in 5.0. I’d probably reject by default in 5.0.1.
>>>
>>> 4.0 / 4.1 - if we treat this like a fix for latent opportunity for data
>>> loss (which it implicitly is), I guess?
>>>
>>>
>>>
>>> > On Sep 12, 2024, at 9:46 AM, Brandon Williams 
>>> wrote:
>>> >
>>> > On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
>>> >  wrote:
>>> >>
>>> >> Are you opposed to the patch in its entirety, or just rejecting
>>> unsafe operations by default?
>>> >
>>> > I had the latter in mind.  Changing any default in a patch release is
>>> > a potential surprise for operators and one of this nature especially
>>> > so.
>>> >
>>> > Kind Regards,
>>> > Brandon
>>>
>>>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jordan West
I’m +1 on enabling rejection by default on all branches. We have been bit
by silent data loss (due to other bugs like the schema issues in 4.1) from
lack of rejection on several occasions and short of writing extremely
specialized tooling its unrecoverable. While both lack of availability and
data loss are critical, I will always pick lack of availability over data
loss. Its better to fail a write that will be lost than silently lose it.

Of course, a change like this requires very good communication in NEWS.txt
and elsewhere but I think its well worth it. While it may surprise some
users I think they would be more surprised that they were silently losing
data.

Jordan

On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever  wrote:

> Thanks for starting the thread Caleb, it is a big and impacting patch.
>
> Appreciate the criticality, in a new major release rejection by default is
> obvious.   Otherwise the logging and metrics is an important addition to
> help users validate the existence and degree of any problem.
>
> Also worth mentioning that rejecting writes can cause degraded
> availability in situations that pose no problem.  This is a coordination
> problem on a probabilistic design, it's choose your evil: unnecessary
> degraded availability or mislocated data (eventual data loss).   Logging
> and metrics makes alerting on and handling the data mislocation possible,
> i.e. avoids data loss with manual intervention.  (Logging and metrics also
> face the same problem with false positives.)
>
> I'm +0 for rejection default in 5.0.1, and +1 for only logging default in
> 4.x
>
>
> On Thu, 12 Sept 2024 at 18:56, Jeff Jirsa  wrote:
>
>> This patch is so hard for me.
>>
>> The safety it adds is critical and should have been added a decade ago.
>> Also it’s a huge patch, and touches “everything”.
>>
>> It definitely belongs in 5.0. I’d probably reject by default in 5.0.1.
>>
>> 4.0 / 4.1 - if we treat this like a fix for latent opportunity for data
>> loss (which it implicitly is), I guess?
>>
>>
>>
>> > On Sep 12, 2024, at 9:46 AM, Brandon Williams  wrote:
>> >
>> > On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
>> >  wrote:
>> >>
>> >> Are you opposed to the patch in its entirety, or just rejecting unsafe
>> operations by default?
>> >
>> > I had the latter in mind.  Changing any default in a patch release is
>> > a potential surprise for operators and one of this nature especially
>> > so.
>> >
>> > Kind Regards,
>> > Brandon
>>
>>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Mick Semb Wever
Thanks for starting the thread Caleb, it is a big and impacting patch.

Appreciate the criticality, in a new major release rejection by default is
obvious.   Otherwise the logging and metrics is an important addition to
help users validate the existence and degree of any problem.

Also worth mentioning that rejecting writes can cause degraded availability
in situations that pose no problem.  This is a coordination problem on a
probabilistic design, it's choose your evil: unnecessary degraded
availability or mislocated data (eventual data loss).   Logging and metrics
makes alerting on and handling the data mislocation possible, i.e. avoids
data loss with manual intervention.  (Logging and metrics also face the
same problem with false positives.)

I'm +0 for rejection default in 5.0.1, and +1 for only logging default in
4.x


On Thu, 12 Sept 2024 at 18:56, Jeff Jirsa  wrote:

> This patch is so hard for me.
>
> The safety it adds is critical and should have been added a decade ago.
> Also it’s a huge patch, and touches “everything”.
>
> It definitely belongs in 5.0. I’d probably reject by default in 5.0.1.
>
> 4.0 / 4.1 - if we treat this like a fix for latent opportunity for data
> loss (which it implicitly is), I guess?
>
>
>
> > On Sep 12, 2024, at 9:46 AM, Brandon Williams  wrote:
> >
> > On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
> >  wrote:
> >>
> >> Are you opposed to the patch in its entirety, or just rejecting unsafe
> operations by default?
> >
> > I had the latter in mind.  Changing any default in a patch release is
> > a potential surprise for operators and one of this nature especially
> > so.
> >
> > Kind Regards,
> > Brandon
>
>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Josh McKenzie
> 4.0 / 4.1 - if we treat this like a fix for latent opportunity for data loss 
> (which it implicitly is), I guess?
If we have known data loss scenarios we should fix them in all supported 
branches even if that fix can potentially modify user-facing behavior. We 
should definitely try and prioritize fixing issues w/out changing behavior of 
course, but if it's zero sum between risk of data loss and risk of user-facing 
disruptive behavior, I'll choose the 2nd every time.

Size of the patch shouldn't really be a make or break on that position, though 
something to consider in terms of risk it introduces, other things we may need 
to touch / test / qualify, extent of fuzzing and potential additional 
property-based testing for subsystems impacted, etc.

On Thu, Sep 12, 2024, at 12:55 PM, Jeff Jirsa wrote:
> This patch is so hard for me. 
> 
> The safety it adds is critical and should have been added a decade ago.
> Also it’s a huge patch, and touches “everything”. 
> 
> It definitely belongs in 5.0. I’d probably reject by default in 5.0.1.  
> 
> 4.0 / 4.1 - if we treat this like a fix for latent opportunity for data loss 
> (which it implicitly is), I guess?
> 
> 
> 
> > On Sep 12, 2024, at 9:46 AM, Brandon Williams  wrote:
> > 
> > On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
> >  wrote:
> >> 
> >> Are you opposed to the patch in its entirety, or just rejecting unsafe 
> >> operations by default?
> > 
> > I had the latter in mind.  Changing any default in a patch release is
> > a potential surprise for operators and one of this nature especially
> > so.
> > 
> > Kind Regards,
> > Brandon
> 
> 


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Jeff Jirsa
This patch is so hard for me. 

The safety it adds is critical and should have been added a decade ago.
Also it’s a huge patch, and touches “everything”. 

It definitely belongs in 5.0. I’d probably reject by default in 5.0.1.  

4.0 / 4.1 - if we treat this like a fix for latent opportunity for data loss 
(which it implicitly is), I guess?



> On Sep 12, 2024, at 9:46 AM, Brandon Williams  wrote:
> 
> On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
>  wrote:
>> 
>> Are you opposed to the patch in its entirety, or just rejecting unsafe 
>> operations by default?
> 
> I had the latter in mind.  Changing any default in a patch release is
> a potential surprise for operators and one of this nature especially
> so.
> 
> Kind Regards,
> Brandon



Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Josh McKenzie
More or less surprising than learning that they've been at risk of or actively 
losing data for years?

On Thu, Sep 12, 2024, at 12:46 PM, Brandon Williams wrote:
> On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
>  wrote:
> >
> > Are you opposed to the patch in its entirety, or just rejecting unsafe 
> > operations by default?
> 
> I had the latter in mind.  Changing any default in a patch release is
> a potential surprise for operators and one of this nature especially
> so.
> 
> Kind Regards,
> Brandon
> 


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Caleb Rackliffe
Potentially losing thousands of records while cluster metadata is changing
is also a surprise, and one that comes with no explanation. Which is worse?

I know that CASSANDRA-12126
 was about
correctness vs performance, not about correctness vs availability, but even
that contrast isn't very honest. Being available isn't productive if we're
not correct.

On Thu, Sep 12, 2024 at 11:46 AM Brandon Williams  wrote:

> On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
>  wrote:
> >
> > Are you opposed to the patch in its entirety, or just rejecting unsafe
> operations by default?
>
> I had the latter in mind.  Changing any default in a patch release is
> a potential surprise for operators and one of this nature especially
> so.
>
> Kind Regards,
> Brandon
>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Brandon Williams
On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
 wrote:
>
> Are you opposed to the patch in its entirety, or just rejecting unsafe 
> operations by default?

I had the latter in mind.  Changing any default in a patch release is
a potential surprise for operators and one of this nature especially
so.

Kind Regards,
Brandon


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Caleb Rackliffe
> I don't think this should be done in a patch release.

Are you opposed to the patch in its entirety, or just rejecting unsafe
operations by default?

On Thu, Sep 12, 2024 at 11:37 AM Chris Lohfink  wrote:

> While the code touches quite a few places the change itself is
> pretty innocuous but is massively impactful in bad scenarios. I am in favor
> of this patch myself as this protects the database from data loss that
> occurs in many different ways. An example I have seen recently (in 4.1) is
> when using GPFS there's some sharp edges around nodes having the wrong view
> of the cluster for short periods of time while bootstrapping that this
> would have prevented.
>
> Chris
>
> On Thu, Sep 12, 2024 at 11:16 AM Brandon Williams 
> wrote:
>
>> On Thu, Sep 12, 2024 at 11:07 AM Caleb Rackliffe
>>  wrote:
>> >
>> > The one consequence of that we might discuss here is that if gossip is
>> behind in notifying a node with a pending range, local rejection as it
>> receives writes for that range may cause a small issue of availability.
>>
>> I don't think this should be done in a patch release.
>>
>> Kind Regards,
>> Brandon
>>
>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Chris Lohfink
While the code touches quite a few places the change itself is
pretty innocuous but is massively impactful in bad scenarios. I am in favor
of this patch myself as this protects the database from data loss that
occurs in many different ways. An example I have seen recently (in 4.1) is
when using GPFS there's some sharp edges around nodes having the wrong view
of the cluster for short periods of time while bootstrapping that this
would have prevented.

Chris

On Thu, Sep 12, 2024 at 11:16 AM Brandon Williams  wrote:

> On Thu, Sep 12, 2024 at 11:07 AM Caleb Rackliffe
>  wrote:
> >
> > The one consequence of that we might discuss here is that if gossip is
> behind in notifying a node with a pending range, local rejection as it
> receives writes for that range may cause a small issue of availability.
>
> I don't think this should be done in a patch release.
>
> Kind Regards,
> Brandon
>


Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

2024-09-12 Thread Brandon Williams
On Thu, Sep 12, 2024 at 11:07 AM Caleb Rackliffe
 wrote:
>
> The one consequence of that we might discuss here is that if gossip is behind 
> in notifying a node with a pending range, local rejection as it receives 
> writes for that range may cause a small issue of availability.

I don't think this should be done in a patch release.

Kind Regards,
Brandon