Thank you Oleksandr,
Just tested on 3.11.1 and it worked for me (you may see the logs below).
Just comprehended that there is one important prerequisite this method to work:
new node MUST be located in the same rack (in terms of C*) as the old one.
Otherwise correct replicas placement order will be violated (I mean when
replicas of the same token range should be placed in different racks).
Anyway, even having successful run of node replacement in sandbox I'm still in
doubt.
Just wondering why this procedure which seems to be much easier than
[add/remove node] or [replace a node] which are documented ways for live node
replacement, has never been included into documentation.
Does anybody in the ML know the reason for this?
Also, for some reason in his article Carlos drops files of system keyspace
(which contains system.local table):
In the new node, delete all system tables except for the schema ones. This will
ensure that the new Cassandra node will not have any corrupt or previous
configuration assigned.
1. sudo cd /var/lib/cassandra/data/system && sudo ls | grep -v schema |
xargs -I {} sudo rm -rf {}
http://engineering.mydrivesolutions.com/posts/cassandra_nodes_replacement/
[Carlos, if you are here might you, please, comment ]
So still a mystery to me.....
-----
Logs for 3.1.11
-----
====== Before:
-- Address Load Tokens Owns (effective) Host ID
Rack
UN 10.10.10.222 256.61 KiB 3 100.0%
bd504008-5ff0-4b6c-a3a6-a07049e61c31 rack1
UN 10.10.10.223 225.65 KiB 3 100.0%
c562263f-4126-4935-b9f7-f4e7d0dc70b4 rack1 <<<<<<
UN 10.10.10.221 187.39 KiB 3 100.0%
d312c083-8808-4c98-a3ab-72a7cd18b31f rack1
======= After:
-- Address Load Tokens Owns (effective) Host ID
Rack
UN 10.10.10.222 245.84 KiB 3 100.0%
bd504008-5ff0-4b6c-a3a6-a07049e61c31 rack1
UN 10.10.10.221 192.8 KiB 3 100.0%
d312c083-8808-4c98-a3ab-72a7cd18b31f rack1
UN 10.10.10.224 266.61 KiB 3 100.0%
c562263f-4126-4935-b9f7-f4e7d0dc70b4 rack1 <<<<<
====== Logs from another node (10.10.10.221):
INFO [HANDSHAKE-/10.10.10.224] 2018-02-03 11:33:01,397
OutboundTcpConnection.java:560 - Handshaking version with /10.10.10.224
INFO [GossipStage:1] 2018-02-03 11:33:01,431 Gossiper.java:1067 - Node
/10.10.10.224 is now part of the cluster
INFO [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 -
InetAddress /10.10.10.224 is now UP
INFO [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 -
InetAddress /10.10.10.224 is now UP
WARN [GossipStage:1] 2018-02-03 11:33:08,375 StorageService.java:2313 - Host
ID collision for c562263f-4126-4935-b9f7-f4e7d0dc70b4 between /10.10.10.223 and
/10.10.10.224; /10.10.10.224 is the new owner
INFO [GossipTasks:1] 2018-02-03 11:33:08,806 Gossiper.java:810 - FatClient
/10.10.10.223 has been silent for 30000ms, removing from gossip
====== Logs from new node:
INFO [main] 2018-02-03 11:33:01,926 StorageService.java:1442 - JOINING: Finish
joining ring
INFO [GossipStage:1] 2018-02-03 11:33:02,659 Gossiper.java:1067 - Node
/10.10.10.223 is now part of the cluster
WARN [GossipStage:1] 2018-02-03 11:33:02,676 StorageService.java:2307 - Not
updating host ID c562263f-4126-4935-b9f7-f4e7d0dc70b4 for /10.10.10.223 because
it's mine
INFO [GossipStage:1] 2018-02-03 11:33:02,683 StorageService.java:2365 - Nodes
/10.10.10.223 and /10.10.10.224 have the same token -7774421781914237508.
Ignoring /10.10.10.223
INFO [GossipStage:1] 2018-02-03 11:33:02,686 StorageService.java:2365 - Nodes
/10.10.10.223 and /10.10.10.224 have the same token 2257660731441815305.
Ignoring /10.10.10.223
INFO [GossipStage:1] 2018-02-03 11:33:02,692 StorageService.java:2365 - Nodes
/10.10.10.223 and /10.10.10.224 have the same token 51879124242594885.
Ignoring /10.10.10.223
WARN [GossipTasks:1] 2018-02-03 11:33:03,985 Gossiper.java:789 - Gossip stage
has 5 pending tasks; skipping status check (no nodes will be marked down)
INFO [main] 2018-02-03 11:33:04,394 SecondaryIndexManager.java:509 - Executing
pre-join tasks for: CFS(Keyspace='test', ColumnFamily='usr')
WARN [GossipTasks:1] 2018-02-03 11:33:05,088 Gossiper.java:789 - Gossip stage
has 7 pending tasks; skipping status check (no nodes will be marked down)
INFO [GossipStage:1] 2018-02-03 11:33:05,718 Gossiper.java:1046 - InetAddress
/10.10.10.223 is now DOWN
INFO [main] 2018-02-03 11:33:06,872 StorageService.java:2268 - Node
/10.10.10.224 state jump to NORMAL
INFO [main] 2018-02-03 11:33:06,998 Gossiper.java:1655 - Waiting for gossip to
settle...
INFO [main] 2018-02-03 11:33:15,004 Gossiper.java:1686 - No gossip backlog;
proceeding
INFO [GossipTasks:1] 2018-02-03 11:33:20,114 Gossiper.java:1046 - InetAddress
/10.10.10.222 is now DOWN <<<<< have no idea why this appeared in logs
INFO [main] 2018-02-03 11:33:20,566 NativeTransportService.java:70 - Netty
using native Epoll event loop
INFO [HANDSHAKE-/10.10.10.222] 2018-02-03 11:33:20,714
OutboundTcpConnection.java:560 - Handshaking version with /10.10.10.222
Kind Regards,
Kyrill
________________________________
From: Oleksandr Shulgin <[email protected]>
Sent: Saturday, February 3, 2018 10:44:26 AM
To: User
Subject: Re: Cassandra 2.1: replace running node without streaming
On 3 Feb 2018 08:49, "Jürgen Albersdorfer"
<[email protected]<mailto:[email protected]>> wrote:
Cool, good to know. Do you know this is still true for 3.11.1?
Well, I've never tried with that specific version, but this is pretty
fundamental, so I would expect it to work the same way. Test in isolation if
you want to be sure, though.
I don't think this is documented anywhere, however, since I had the same doubts
before seeing it worked for the first time.
--
Alex
Am 03.02.2018 um 08:19 schrieb Oleksandr Shulgin
<[email protected]<mailto:[email protected]>>:
On 3 Feb 2018 02:42, "Kyrylo Lebediev"
<[email protected]<mailto:[email protected]>> wrote:
Thanks, Oleksandr,
In my case I'll need to replace all nodes in the cluster (one-by-one), so
streaming will introduce perceptible overhead.
My question is not about data movement/copy itself, but more about all this
token magic.
Okay, let's say we stopped old node, moved data to new node.
Once it's started with auto_bootstrap=false it will be added to the cluster
like an usual node, just skipping streaming stage, right?
For a cluster with vnodes enabled, during addition of new node its token ranges
are calculated automatically by C* on startup.
So, how will C* know that this new node must be responsible for exactly the
same token ranges as the old node was?
How would the rest of nodes in the cluster ('peers') figure out that old node
should be replaced in ring by the new one?
Do you know about some limitation for this process in case of C* 2.1.x with
vnodes enabled?
A node stores its tokens and host id in the system.local table. Next time it
starts up, it will use the same tokens as previously and the host id allows the
rest of the cluster to see that it is the same node and ignore the IP address
change. This happens regardless of auto_bootstrap setting.
Try "select * from system.local" to see what is recorded for the old node. When
the new node starts up it should log "Using saved tokens" with the list of
numbers. Other nodes should log something like "ignoring IP address change" for
the affected node addresses.
Be careful though, to make sure that you put the data directory exactly where
the new node expects to find it: otherwise it might just join as a brand new
one, allocating new tokens. As a precaution it helps to ensure that the system
user running the Cassandra process has no permission to create the data
directory: this should stop the startup in case of misconfiguration.
Cheers,
--
Alex