Because executing “removenode” streamed extra data from live nodes to the “gaining” replica

Oversimplified (if you had one token per node) 

If you  start with A B C

Then add D

D should bootstrap a range from each of A B and C, but at the end, some of the data that was A B C becomes B C D

When you removenode, you tell B and C to send data back to A. 

A B and C will eventually contact that data away. Eventually. 

If you get around to adding D again, running “cleanup” when you’re done (successfully) will remove a lot of it. 



On Apr 3, 2023, at 8:14 PM, David Tinker <david.tin...@gmail.com> wrote:


Looks like the remove has sorted things out. Thanks.

One thing I am wondering about is why the nodes are carrying a lot more data? The loads were about 2.7T before, now 3.4T. 

# nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load      Tokens  Owns (effective)  Host ID                               Rack
UN  xxx.xxx.xxx.105  3.4 TiB   256     100.0%            afd02287-3f88-4c6f-8b27-06f7a8192402  rack3
UN  xxx.xxx.xxx.253  3.34 TiB  256     100.0%            e1af72be-e5df-4c6b-a124-c7bc48c6602a  rack2
UN  xxx.xxx.xxx.107  3.44 TiB  256     100.0%            ab72f017-be96-41d2-9bef-a551dec2c7b5  rack1


On Mon, Apr 3, 2023 at 5:42 PM Bowen Song via user <user@cassandra.apache.org> wrote:

That's correct. nodetool removenode is strongly preferred when your node is already down. If the node is still functional, use nodetool decommission on the node instead.

On 03/04/2023 16:32, Jeff Jirsa wrote:
FWIW, `nodetool decommission` is strongly preferred. `nodetool removenode` is designed to be run when a host is offline. Only decommission is guaranteed to maintain consistency / correctness, and removemode probably streams a lot more data around than decommission.


On Mon, Apr 3, 2023 at 6:47 AM Bowen Song via user <user@cassandra.apache.org> wrote:

Use nodetool removenode is strongly preferred in most circumstances, and only resort to assassinate if you do not care about data consistency or you know there won't be any consistency issue (e.g. no new writes and did not run nodetool cleanup).

Since the size of data on the new node is small, nodetool removenode should finish fairly quickly and bring your cluster back.

Next time when you are doing something like this again, please test it out on a non-production environment, make sure everything works as expected before moving onto the production.


On 03/04/2023 06:28, David Tinker wrote:
Should I use assassinate or removenode? Given that there is some data on the node. Or will that be found on the other nodes? Sorry for all the questions but I really don't want to mess up.

On Mon, Apr 3, 2023 at 7:21 AM Carlos Diaz <crdiaz...@gmail.com> wrote:
That's what nodetool assassinte will do.

On Sun, Apr 2, 2023 at 10:19 PM David Tinker <david.tin...@gmail.com> wrote:
Is it possible for me to remove the node from the cluster i.e. to undo this mess and get the cluster operating again?

On Mon, Apr 3, 2023 at 7:13 AM Carlos Diaz <crdiaz...@gmail.com> wrote:
You can leave it in the seed list of the other nodes, just make sure it's not included in this node's seed list.  However, if you do decide to fix the issue with the racks first assassinate this node (nodetool assassinate <ip>), and update the rack name before you restart. 

On Sun, Apr 2, 2023 at 10:06 PM David Tinker <david.tin...@gmail.com> wrote:
It is also in the seeds list for the other nodes. Should I remove it from those, restart them one at a time, then restart it?

/etc/cassandra # grep -i bootstrap *
doesn't show anything so I don't think I have auto_bootstrap false.

Thanks very much for the help.


On Mon, Apr 3, 2023 at 7:01 AM Carlos Diaz <crdiaz...@gmail.com> wrote:
Just remove it from the seed list in the cassandra.yaml file and restart the node.  Make sure that auto_bootstrap is set to true first though. 

On Sun, Apr 2, 2023 at 9:59 PM David Tinker <david.tin...@gmail.com> wrote:
So likely because I made it a seed node when I added it to the cluster it didn't do the bootstrap process. How can I recover this?

On Mon, Apr 3, 2023 at 6:41 AM David Tinker <david.tin...@gmail.com> wrote:
Yes replication factor is 3.

I ran nodetool repair -pr on all the nodes (one at a time) and am still having issues getting data back from queries.

I did make the new node a seed node.

Re "rack4": I assumed that was just an indication as to the physical location of the server for redundancy. This one is separate from the others so I used rack4.

On Mon, Apr 3, 2023 at 6:30 AM Carlos Diaz <crdiaz...@gmail.com> wrote:
I'm assuming that your replication factor is 3.  If that's the case, did you intentionally put this node in rack 4?  Typically, you want to add nodes in multiples of your replication factor in order to keep the "racks" balanced.  In other words, this node should have been added to rack 1, 2 or 3. 

Having said that, you should be able to easily fix your problem by running a nodetool repair -pr on the new node. 

On Sun, Apr 2, 2023 at 8:16 PM David Tinker <david.tin...@gmail.com> wrote:
Hi All

I recently added a node to my 3 node Cassandra 4.0.5 cluster and now many reads are not returning rows! What do I need to do to fix this? There weren't any errors in the logs or other problems that I could see. I expected the cluster to balance itself but this hasn't happened (yet?). The nodes are similar so I have num_tokens=256 for each. I am using the Murmur3Partitioner.

# nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens  Owns (effective)  Host ID                               Rack
UN  xxx.xxx.xxx.105  2.65 TiB   256     72.9%             afd02287-3f88-4c6f-8b27-06f7a8192402  rack3
UN  xxx.xxx.xxx.253  2.6 TiB    256     73.9%             e1af72be-e5df-4c6b-a124-c7bc48c6602a  rack2
UN  xxx.xxx.xxx.24   93.82 KiB  256     80.0%             c4e8b4a0-f014-45e6-afb4-648aad4f8500  rack4
UN  xxx.xxx.xxx.107  2.65 TiB   256     73.2%             ab72f017-be96-41d2-9bef-a551dec2c7b5  rack1

# nodetool netstats
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed   Dropped
Large messages                  n/a         0          71754         0
Small messages                  n/a         0        8398184        14
Gossip messages                 n/a         0        1303634         0

# nodetool ring
Datacenter: dc1
==========
Address               Rack        Status State   Load            Owns                Token
                                                                                     9189523899826545641
xxx.xxx.xxx.24        rack4       Up     Normal  93.82 KiB       79.95%              -9194674091837769168
xxx.xxx.xxx.107       rack1       Up     Normal  2.65 TiB        73.25%              -9168781258594813088
xxx.xxx.xxx.253       rack2       Up     Normal  2.6 TiB         73.92%              -9163037340977721917
xxx.xxx.xxx.105       rack3       Up     Normal  2.65 TiB        72.88%              -9148860739730046229
xxx.xxx.xxx.107       rack1       Up     Normal  2.65 TiB        73.25%              -9125240034139323535
xxx.xxx.xxx.253       rack2       Up     Normal  2.6 TiB         73.92%              -9112518853051755414
xxx.xxx.xxx.105       rack3       Up     Normal  2.65 TiB        72.88%              -9100516173422432134
...


This is causing a serious production issue. Please help if you can.

Thanks
David



Reply via email to