Re: Mixed Cluster 4.0 and 4.1

2024-04-24 Thread Bowen Song via user

Hi Paul,

IMO, if they are truly risk-adverse, they should follow the tested and 
proven best practices, instead of doing things in a less tested way 
which is also know to pose a danger to the data correctness.


If they must do this over a long period of time, then they may need to 
temporarily increase the gc_grace_seconds on all tables, and ensure that 
no DDL or repair is run before the upgrade completes. It is unknown 
whether this route is safe, because it's a less tested route to upgrade 
a cluster.


Please be aware that if they do deletes frequently, increasing the 
gc_grace_seconds may cause some reads to fail due to the elevated number 
of tombstones.


Cheers,
Bowen

On 24/04/2024 17:25, Paul Chandler wrote:

Hi Bowen,

Thanks for your quick reply.

Sorry I used the wrong term there, there it is a maintenance window rather than 
an outage. This is a key system and the vital nature of it means that the 
customer is rightly very risk adverse, so we will only even get permission to 
upgrade one DC per night via a rolling upgrade, meaning this will always be 
over more than a week.

So we can’t shorten the time the cluster is in mixed mode, but I am concerned 
about having a schema mismatch for this long time. Should I be concerned, or 
have others upgraded in a similar way?

Thanks

Paul


On 24 Apr 2024, at 17:02, Bowen Song via user  wrote:

Hi Paul,

You don't need to plan for or introduce an outage for a rolling upgrade, which 
is the preferred route. It isn't advisable to take down an entire DC to do 
upgrade.

You should aim to complete upgrading the entire cluster and finish a full 
repair within the shortest gc_grace_seconds (default to 10 days) of all tables. 
Failing to do that may cause data resurrections.

During the rolling upgrade, you should not run repair or any DDL query (such as 
ALTER TABLE, TRUNCATE, etc.).

You don't need to do the rolling upgrade node by node. You can do it rack by 
rack. Stopping all nodes in a single rack and upgrade them concurrently is much 
faster. The number of nodes doesn't matter that much to the time required to 
complete a rolling upgrade, it's the number of DCs and racks matter.

Cheers,
Bowen

On 24/04/2024 16:16, Paul Chandler wrote:

Hi all,

We have some large clusters ( 1000+  nodes ), these are across multiple 
datacenters.

When we perform upgrades we would normally upgrade a DC at a time during a 
planned outage for one DC. This means that a cluster might be in a mixed mode 
with multiple versions for a week or 2.

We have noticed that during our testing that upgrading to 4.1 causes a schema 
mismatch due to the new tables added into the system keyspace.

Is this going to be an issue if this schema mismatch lasts for maybe several 
weeks? I assume that running any DDL during that time would be a bad idea, is 
there any other issues to look out for?

Thanks

Paul Chandler


Re: Mixed Cluster 4.0 and 4.1

2024-04-24 Thread Paul Chandler
Hi Bowen,

Thanks for your quick reply. 

Sorry I used the wrong term there, there it is a maintenance window rather than 
an outage. This is a key system and the vital nature of it means that the 
customer is rightly very risk adverse, so we will only even get permission to 
upgrade one DC per night via a rolling upgrade, meaning this will always be 
over more than a week. 

So we can’t shorten the time the cluster is in mixed mode, but I am concerned 
about having a schema mismatch for this long time. Should I be concerned, or 
have others upgraded in a similar way?

Thanks

Paul

> On 24 Apr 2024, at 17:02, Bowen Song via user  
> wrote:
> 
> Hi Paul,
> 
> You don't need to plan for or introduce an outage for a rolling upgrade, 
> which is the preferred route. It isn't advisable to take down an entire DC to 
> do upgrade.
> 
> You should aim to complete upgrading the entire cluster and finish a full 
> repair within the shortest gc_grace_seconds (default to 10 days) of all 
> tables. Failing to do that may cause data resurrections.
> 
> During the rolling upgrade, you should not run repair or any DDL query (such 
> as ALTER TABLE, TRUNCATE, etc.).
> 
> You don't need to do the rolling upgrade node by node. You can do it rack by 
> rack. Stopping all nodes in a single rack and upgrade them concurrently is 
> much faster. The number of nodes doesn't matter that much to the time 
> required to complete a rolling upgrade, it's the number of DCs and racks 
> matter.
> 
> Cheers,
> Bowen
> 
> On 24/04/2024 16:16, Paul Chandler wrote:
>> Hi all,
>> 
>> We have some large clusters ( 1000+  nodes ), these are across multiple 
>> datacenters.
>> 
>> When we perform upgrades we would normally upgrade a DC at a time during a 
>> planned outage for one DC. This means that a cluster might be in a mixed 
>> mode with multiple versions for a week or 2.
>> 
>> We have noticed that during our testing that upgrading to 4.1 causes a 
>> schema mismatch due to the new tables added into the system keyspace.
>> 
>> Is this going to be an issue if this schema mismatch lasts for maybe several 
>> weeks? I assume that running any DDL during that time would be a bad idea, 
>> is there any other issues to look out for?
>> 
>> Thanks
>> 
>> Paul Chandler



Re: Mixed Cluster 4.0 and 4.1

2024-04-24 Thread Bowen Song via user

Hi Paul,

You don't need to plan for or introduce an outage for a rolling upgrade, 
which is the preferred route. It isn't advisable to take down an entire 
DC to do upgrade.


You should aim to complete upgrading the entire cluster and finish a 
full repair within the shortest gc_grace_seconds (default to 10 days) of 
all tables. Failing to do that may cause data resurrections.


During the rolling upgrade, you should not run repair or any DDL query 
(such as ALTER TABLE, TRUNCATE, etc.).


You don't need to do the rolling upgrade node by node. You can do it 
rack by rack. Stopping all nodes in a single rack and upgrade them 
concurrently is much faster. The number of nodes doesn't matter that 
much to the time required to complete a rolling upgrade, it's the number 
of DCs and racks matter.


Cheers,
Bowen

On 24/04/2024 16:16, Paul Chandler wrote:

Hi all,

We have some large clusters ( 1000+  nodes ), these are across multiple 
datacenters.

When we perform upgrades we would normally upgrade a DC at a time during a 
planned outage for one DC. This means that a cluster might be in a mixed mode 
with multiple versions for a week or 2.

We have noticed that during our testing that upgrading to 4.1 causes a schema 
mismatch due to the new tables added into the system keyspace.

Is this going to be an issue if this schema mismatch lasts for maybe several 
weeks? I assume that running any DDL during that time would be a bad idea, is 
there any other issues to look out for?

Thanks

Paul Chandler