Questions regarding Cassandra 4 and Cassandra 4.1
Hi all, Earlier this year, we upgraded our fleet from C* 3.0 to C* 4.0. Given the exciting new features in C* 4.1, we are contemplating an upgrade from C* 4.0 to C* 4.1. Can anyone share their experience regarding the stability of C* 4.1? Are any of you running C* 4.1 at scale? Additionally, I have a query about repair procedures. Due to the known instability of incremental repair in C* 3.0, we've consistently opted for full repairs on all our clusters. With the advancements in C* 4.0 regarding incremental repair, has its stability improved? Which repair method are you currently using: full or incremental? Thanks, Runtian
4.0 upgrade
Hi, We are upgrading our Cassandra clusters from 3.0.27 to 4.0.6 and we observed some error related to repair: j.l.IllegalArgumentException: Unknown verb id 32 We have two datacenters for each Cassandra cluster and when we are doing an upgrade, we want to upgrade 1 datacenter first and monitor the upgrade datacenter for some time (1 week) to make sure there is no issue, then we will upgrade the second datacenter for that cluster. We have some automated repair jobs running, is it expected to have repair stuck if we have 1 datacenter on 4.0 and 1 datacenter on 3.0? Do you have any suggestions on how we should do the upgrade, is waiting for 1 week between two datacenters too long? Thanks, Runtian
Re: Replacing node without shutting down the old node
cool, thank you. This looks like a very good setup for us and cleanup should be very fast for this case. On Tue, May 16, 2023 at 5:53 AM Jeff Jirsa wrote: > > In-line > > On May 15, 2023, at 5:26 PM, Runtian Liu wrote: > > > Hi Jeff, > > I tried the setup with vnode 16 and NetworkTopologyStrategy replication > strategy with replication factor 3 with 3 racks in one cluster. When using > the new node token as the old node token - 1 > > > I had said +1 but you’re right that it’s actually -1 , sorry about that. > You want the new node to be lower than the existing host. The lower token > will take most of the data. > > I see the new node is streaming from the old node only. And the decom > phase of the old node is extremely fast. Does this mean the new node will > only take data ownership from the old node? > > > With exactly three racks, yes. With more racks or fewer racks, no. > > I also did some cleanups after replacing node with old token - 1 and the > cleanup sstable count was not increasing. Looks like adding a node with > old_token - 1 and decom the old node will not generate stale data on the > rest of the cluster. Do you know if there are any edge cases that in this > replacement process can generate any stale data on other nodes of the > cluster with the setup I mentioned? > > > Should do exactly what you want. I’d still run cleanup but it should be a > no-op. > > > Thanks, > Runtian > > On Mon, May 8, 2023 at 9:59 PM Runtian Liu wrote: > >> I thought the joining node would not participate in quorum? How are we >> counting things like how many replicas ACK a write when we are adding a new >> node for expansion? The token ownership won't change until the new node is >> fully joined right? >> >> On Mon, May 8, 2023 at 8:58 PM Jeff Jirsa wrote: >> >>> You can't have two nodes with the same token (in the current metadata >>> implementation) - it causes problems counting things like how many replicas >>> ACK a write, and what happens if the one you're replacing ACKs a write but >>> the joining host doesn't? It's harder than it seems to maintain consistency >>> guarantees in that model, because you have 2 nodes where either may end up >>> becoming the sole true owner of the token, and you have to handle both >>> cases where one of them fails. >>> >>> An easier option is to add it with new token set to old token +1 (as an >>> expansion), then decom the leaving node (shrink). That'll minimize >>> streaming when you decommission that node. >>> >>> >>> >>> On Mon, May 8, 2023 at 7:19 PM Runtian Liu wrote: >>> >>>> Hi all, >>>> >>>> Sometimes we want to replace a node for various reasons, we can replace >>>> a node by shutting down the old node and letting the new node stream data >>>> from other replicas, but this approach may have availability issues or data >>>> consistency issues if one more node in the same cluster went down. Why >>>> Cassandra doesn't support replacing a node without shutting down the old >>>> one? Can we treat the new node as normal node addition while having exactly >>>> the same token ranges as the node to be replaced. After the new node's >>>> joining process is complete, we just need to cut off the old node. With >>>> this, we don't lose any availability and the token range is not moved so no >>>> clean up is needed. Is there any downside of doing this? >>>> >>>> Thanks, >>>> Runtian >>>> >>>
Re: Replacing node without shutting down the old node
Hi Jeff, I tried the setup with vnode 16 and NetworkTopologyStrategy replication strategy with replication factor 3 with 3 racks in one cluster. When using the new node token as the old node token - 1, I see the new node is streaming from the old node only. And the decom phase of the old node is extremely fast. Does this mean the new node will only take data ownership from the old node? I also did some cleanups after replacing node with old token - 1 and the cleanup sstable count was not increasing. Looks like adding a node with old_token - 1 and decom the old node will not generate stale data on the rest of the cluster. Do you know if there are any edge cases that in this replacement process can generate any stale data on other nodes of the cluster with the setup I mentioned? Thanks, Runtian On Mon, May 8, 2023 at 9:59 PM Runtian Liu wrote: > I thought the joining node would not participate in quorum? How are we > counting things like how many replicas ACK a write when we are adding a new > node for expansion? The token ownership won't change until the new node is > fully joined right? > > On Mon, May 8, 2023 at 8:58 PM Jeff Jirsa wrote: > >> You can't have two nodes with the same token (in the current metadata >> implementation) - it causes problems counting things like how many replicas >> ACK a write, and what happens if the one you're replacing ACKs a write but >> the joining host doesn't? It's harder than it seems to maintain consistency >> guarantees in that model, because you have 2 nodes where either may end up >> becoming the sole true owner of the token, and you have to handle both >> cases where one of them fails. >> >> An easier option is to add it with new token set to old token +1 (as an >> expansion), then decom the leaving node (shrink). That'll minimize >> streaming when you decommission that node. >> >> >> >> On Mon, May 8, 2023 at 7:19 PM Runtian Liu wrote: >> >>> Hi all, >>> >>> Sometimes we want to replace a node for various reasons, we can replace >>> a node by shutting down the old node and letting the new node stream data >>> from other replicas, but this approach may have availability issues or data >>> consistency issues if one more node in the same cluster went down. Why >>> Cassandra doesn't support replacing a node without shutting down the old >>> one? Can we treat the new node as normal node addition while having exactly >>> the same token ranges as the node to be replaced. After the new node's >>> joining process is complete, we just need to cut off the old node. With >>> this, we don't lose any availability and the token range is not moved so no >>> clean up is needed. Is there any downside of doing this? >>> >>> Thanks, >>> Runtian >>> >>
Re: Replacing node without shutting down the old node
I thought the joining node would not participate in quorum? How are we counting things like how many replicas ACK a write when we are adding a new node for expansion? The token ownership won't change until the new node is fully joined right? On Mon, May 8, 2023 at 8:58 PM Jeff Jirsa wrote: > You can't have two nodes with the same token (in the current metadata > implementation) - it causes problems counting things like how many replicas > ACK a write, and what happens if the one you're replacing ACKs a write but > the joining host doesn't? It's harder than it seems to maintain consistency > guarantees in that model, because you have 2 nodes where either may end up > becoming the sole true owner of the token, and you have to handle both > cases where one of them fails. > > An easier option is to add it with new token set to old token +1 (as an > expansion), then decom the leaving node (shrink). That'll minimize > streaming when you decommission that node. > > > > On Mon, May 8, 2023 at 7:19 PM Runtian Liu wrote: > >> Hi all, >> >> Sometimes we want to replace a node for various reasons, we can replace a >> node by shutting down the old node and letting the new node stream data >> from other replicas, but this approach may have availability issues or data >> consistency issues if one more node in the same cluster went down. Why >> Cassandra doesn't support replacing a node without shutting down the old >> one? Can we treat the new node as normal node addition while having exactly >> the same token ranges as the node to be replaced. After the new node's >> joining process is complete, we just need to cut off the old node. With >> this, we don't lose any availability and the token range is not moved so no >> clean up is needed. Is there any downside of doing this? >> >> Thanks, >> Runtian >> >
Replacing node without shutting down the old node
Hi all, Sometimes we want to replace a node for various reasons, we can replace a node by shutting down the old node and letting the new node stream data from other replicas, but this approach may have availability issues or data consistency issues if one more node in the same cluster went down. Why Cassandra doesn't support replacing a node without shutting down the old one? Can we treat the new node as normal node addition while having exactly the same token ranges as the node to be replaced. After the new node's joining process is complete, we just need to cut off the old node. With this, we don't lose any availability and the token range is not moved so no clean up is needed. Is there any downside of doing this? Thanks, Runtian
Re: Is cleanup is required if cluster topology changes
We are doing the "adding a node then decommissioning a node" to achieve better availability. Replacing a node need to shut down one node first, if another node is down during the node replacement period, we will get availability drop because most of our use case is local_quorum with replication factor 3. On Fri, May 5, 2023 at 5:59 AM Bowen Song via user < user@cassandra.apache.org> wrote: > Have you thought of using "-Dcassandra.replace_address_first_boot=..." (or > "-Dcassandra.replace_address=..." if you are using an older version)? This > will not result in a topology change, which means "nodetool cleanup" is not > needed after the operation is completed. > On 05/05/2023 05:24, Jaydeep Chovatia wrote: > > Thanks, Jeff! > But in our environment we replace nodes quite often for various > optimization purposes, etc. say, almost 1 node per day (node *addition* > followed by node *decommission*, which of course changes the topology), > and we have a cluster of size 100 nodes with 300GB per node. If we have to > run cleanup on 100 nodes after every replacement, then it could take > forever. > What is the recommendation until we get this fixed in Cassandra itself as > part of compaction (w/o externally triggering *cleanup*)? > > Jaydeep > > On Thu, May 4, 2023 at 8:14 PM Jeff Jirsa wrote: > >> Cleanup is fast and cheap and basically a no-op if you haven’t changed >> the ring >> >> After cassandra has transactional cluster metadata to make ring changes >> strongly consistent, cassandra should do this in every compaction. But >> until then it’s left for operators to run when they’re sure the state of >> the ring is correct . >> >> >> >> On May 4, 2023, at 7:41 PM, Jaydeep Chovatia >> wrote: >> >> >> Isn't this considered a kind of *bug* in Cassandra because as we know >> *cleanup* is a lengthy and unreliable operation, so relying on the >> *cleanup* means higher chances of data resurrection? >> Do you think we should discard the unowned token-ranges as part of the >> regular compaction itself? What are the pitfalls of doing this as part of >> compaction itself? >> >> Jaydeep >> >> On Thu, May 4, 2023 at 7:25 PM guo Maxwell wrote: >> >>> compact ion will just merge duplicate data and remove delete data in >>> this node .if you add or remove one node for the cluster, I think clean up >>> is needed. if clean up failed, I think we should come to see the reason. >>> >>> Runtian Liu 于2023年5月5日周五 06:37写道: >>> >>>> Hi all, >>>> >>>> Is cleanup the sole method to remove data that does not belong to a >>>> specific node? In a cluster, where nodes are added or decommissioned from >>>> time to time, failure to run cleanup may lead to data resurrection issues, >>>> as deleted data may remain on the node that lost ownership of certain >>>> partitions. Or is it true that normal compactions can also handle data >>>> removal for nodes that no longer have ownership of certain data? >>>> >>>> Thanks, >>>> Runtian >>>> >>> >>> >>> -- >>> you are the apple of my eye ! >>> >>
Is cleanup is required if cluster topology changes
Hi all, Is cleanup the sole method to remove data that does not belong to a specific node? In a cluster, where nodes are added or decommissioned from time to time, failure to run cleanup may lead to data resurrection issues, as deleted data may remain on the node that lost ownership of certain partitions. Or is it true that normal compactions can also handle data removal for nodes that no longer have ownership of certain data? Thanks, Runtian
Cassandra 3.0 upgrade
Hi, I am running Cassandra version 3.0.14 at scale on thousands of nodes. I am planning to do a minor version upgrade from 3.0.14 to 3.0.26 in a safe manner. My eventual goal is to upgrade from 3.0.26 to a major release 4.0. As you know, there are multiple minor releases between 3.0.14 and 3.0.26, so I am planning to upgrade in 2-3 batches say 1) 3.0.14 → 3.0.16 2) 3.0.16 to 3.0.20 3) 3.0.20 → 3.0.26. . Do you have suggestions or anything that I need to be aware of? Is there any minor release between 3.0.14 and 3.0.26, which is not safe etc.? Best regards.