Re: [DISCUSS] Gossip shutdown may corrupt peers making it so the cluster never converges, and a small protocol change to fix

2023-10-09 Thread David Capwell
Brandon and I have been talking in CASSANDRA-18913 and here is the current plan; sharing for visibility There are two bugs: 1) restart and seeing a shutdown event before gossip has settled for you will create a partial EndpointState which leads to failed startup 2) shutdown corrupts state due

Re: [DISCUSS] Gossip shutdown may corrupt peers making it so the cluster never converges, and a small protocol change to fix

2023-10-06 Thread David Capwell
> Won't the replacement have a newer generation? The replacement is a different instance. I performs a shadow round with its seeds and if they are impacted by this issue then they are missing tokens, so we fail the host replacement… you can work around this by changing the seeds to nodes that

Re: [DISCUSS] Gossip shutdown may corrupt peers making it so the cluster never converges, and a small protocol change to fix

2023-10-06 Thread Brandon Williams
On Fri, Oct 6, 2023 at 5:50 PM David Capwell wrote: > Lets say you now need to host replace node1 Won't the replacement have a newer generation? > avoid peers mutating endpoint states they don’t own This sounds reasonable to me. > This would be a protocol change, so would need to make sure