Re: Cassandra 2.1: replace running node without streaming
On Sat, Feb 3, 2018 at 11:23 AM, Kyrylo Lebediev wrote: > Just tested on 3.11.1 and it worked for me (you may see the logs below). > > Just comprehended that there is one important prerequisite this method to > work: new node MUST be located in the same rack (in terms of C*) as the old > one. Otherwise correct replicas placement order will be violated (I mean > when replicas of the same token range should be placed in different racks). > Correct. Anyway, even having successful run of node replacement in sandbox I'm still > in doubt. > > Just wondering why this procedure which seems to be much easier than > [add/remove node] or [replace a node] which are documented ways for live > node replacement, has never been included into documentation. > > Does anybody in the ML know the reason for this? > There are a number of reasons why one would need to replace a node. Losing a disk would be the most frequent one, I guess. In that case using replace_address is the way to go, since it allows you to avoid any ownership changes. At the same time on EC2 you might be replacing nodes in order to apply security updates to your base machine image, etc. In this case it is possible to apply the described procedure to migrate the data to the new node. However, given that your nodes are small enough, simply using replace_address seems like a more straightforward way to me. Also, for some reason in his article Carlos drops files of system keyspace > (which contains system.local table): > > In the new node, delete all system tables except for the schema ones. This > will ensure that the new Cassandra node will not have any corrupt or > previous configuration assigned. > >1. sudo cd /var/lib/cassandra/data/system && sudo ls | grep -v schema >| xargs -I {} sudo rm -rf {} > > Ah, this sounds like a wrong thing to do. That would remove system.local keyspace, which I expect makes the node forget its tokens. I wouldn't do that: the node's state on disk should be just like after a normal restart. -- Alex
Re: Cassandra 2.1: replace running node without streaming
Good Point about the Rack - Kyrill! This makes total sense to me. Deleting the System Keyspace not really, If this contains all sensitive information about the node. Maybe this makes sense in conjunction with replace_node(at_first_boot) Option. Some comments from devs about this would be great. Regards, Jürgen > Am 03.02.2018 um 16:42 schrieb Kyrylo Lebediev : > > I've found modified Carlos' article (more recent than that I was referring > to) and this one contains the same method as you described, Oleksandr: > https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-place-node-replacement > > Thank you for your readiness to help! > > Kind Regards, > Kyrill > From: Kyrylo Lebediev > Sent: Saturday, February 3, 2018 12:23:15 PM > To: User > Subject: Re: Cassandra 2.1: replace running node without streaming > > Thank you Oleksandr, > Just tested on 3.11.1 and it worked for me (you may see the logs below). > Just comprehended that there is one important prerequisite this method to > work: new node MUST be located in the same rack (in terms of C*) as the old > one. Otherwise correct replicas placement order will be violated (I mean when > replicas of the same token range should be placed in different racks). > > Anyway, even having successful run of node replacement in sandbox I'm still > in doubt. > Just wondering why this procedure which seems to be much easier than > [add/remove node] or [replace a node] which are documented ways for live node > replacement, has never been included into documentation. > Does anybody in the ML know the reason for this? > > Also, for some reason in his article Carlos drops files of system keyspace > (which contains system.local table): > In the new node, delete all system tables except for the schema ones. This > will ensure that the new Cassandra node will not have any corrupt or previous > configuration assigned. > sudo cd /var/lib/cassandra/data/system && sudo ls | grep -v schema | xargs -I > {} sudo rm -rf {} > > http://engineering.mydrivesolutions.com/posts/cassandra_nodes_replacement/ > [Carlos, if you are here might you, please, comment ] > > So still a mystery to me. > > - > Logs for 3.1.11 > - > > == Before: > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.10.10.222 256.61 KiB 3100.0% > bd504008-5ff0-4b6c-a3a6-a07049e61c31 rack1 > UN 10.10.10.223 225.65 KiB 3100.0% > c562263f-4126-4935-b9f7-f4e7d0dc70b4 rack1 <<<<<< > UN 10.10.10.221 187.39 KiB 3100.0% > d312c083-8808-4c98-a3ab-72a7cd18b31f rack1 > > === After: > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.10.10.222 245.84 KiB 3100.0% > bd504008-5ff0-4b6c-a3a6-a07049e61c31 rack1 > UN 10.10.10.221 192.8 KiB 3100.0% > d312c083-8808-4c98-a3ab-72a7cd18b31f rack1 > UN 10.10.10.224 266.61 KiB 3100.0% > c562263f-4126-4935-b9f7-f4e7d0dc70b4 rack1 <<<<< > > > > == Logs from another node (10.10.10.221): > INFO [HANDSHAKE-/10.10.10.224] 2018-02-03 11:33:01,397 > OutboundTcpConnection.java:560 - Handshaking version with /10.10.10.224 > INFO [GossipStage:1] 2018-02-03 11:33:01,431 Gossiper.java:1067 - Node > /10.10.10.224 is now part of the cluster > INFO [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 - > InetAddress /10.10.10.224 is now UP > INFO [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 - > InetAddress /10.10.10.224 is now UP > WARN [GossipStage:1] 2018-02-03 11:33:08,375 StorageService.java:2313 - Host > ID collision for c562263f-4126-4935-b9f7-f4e7d0dc70b4 between /10.10.10.223 > and /10.10.10.224; /10.10.10.224 is the new owner > INFO [GossipTasks:1] 2018-02-03 11:33:08,806 Gossiper.java:810 - FatClient > /10.10.10.223 has been silent for 3ms, removing from gossip > > == Logs from new node: > INFO [main] 2018-02-03 11:33:01,926 StorageService.java:1442 - JOINING: > Finish joining ring > INFO [GossipStage:1] 2018-02-03 11:33:02,659 Gossiper.java:1067 - Node > /10.10.10.223 is now part of the cluster > WARN [GossipStage:1] 2018-02-03 11:33:02,676 StorageService.java:2307 - Not > updating host ID c562263f-4126-4935-b9f7-f4e7d0dc70b4 for /10.10.10.223 > because it's mine > INFO [GossipStage:1] 2018-02-03 11:33:02,683 StorageService.java:2365 - > Nodes /10.10.10.223 and /10.10.10.224 have the same token > -7774421781914237508. Ignoring /10.10.1
Re: Cassandra 2.1: replace running node without streaming
I've found modified Carlos' article (more recent than that I was referring to) and this one contains the same method as you described, Oleksandr: https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-place-node-replacement Thank you for your readiness to help! Kind Regards, Kyrill From: Kyrylo Lebediev Sent: Saturday, February 3, 2018 12:23:15 PM To: User Subject: Re: Cassandra 2.1: replace running node without streaming Thank you Oleksandr, Just tested on 3.11.1 and it worked for me (you may see the logs below). Just comprehended that there is one important prerequisite this method to work: new node MUST be located in the same rack (in terms of C*) as the old one. Otherwise correct replicas placement order will be violated (I mean when replicas of the same token range should be placed in different racks). Anyway, even having successful run of node replacement in sandbox I'm still in doubt. Just wondering why this procedure which seems to be much easier than [add/remove node] or [replace a node] which are documented ways for live node replacement, has never been included into documentation. Does anybody in the ML know the reason for this? Also, for some reason in his article Carlos drops files of system keyspace (which contains system.local table): In the new node, delete all system tables except for the schema ones. This will ensure that the new Cassandra node will not have any corrupt or previous configuration assigned. 1. sudo cd /var/lib/cassandra/data/system && sudo ls | grep -v schema | xargs -I {} sudo rm -rf {} http://engineering.mydrivesolutions.com/posts/cassandra_nodes_replacement/ [Carlos, if you are here might you, please, comment ] So still a mystery to me. - Logs for 3.1.11 - == Before: -- Address Load Tokens Owns (effective) Host ID Rack UN 10.10.10.222 256.61 KiB 3100.0% bd504008-5ff0-4b6c-a3a6-a07049e61c31 rack1 UN 10.10.10.223 225.65 KiB 3100.0% c562263f-4126-4935-b9f7-f4e7d0dc70b4 rack1 <<<<<< UN 10.10.10.221 187.39 KiB 3100.0% d312c083-8808-4c98-a3ab-72a7cd18b31f rack1 === After: -- Address Load Tokens Owns (effective) Host ID Rack UN 10.10.10.222 245.84 KiB 3100.0% bd504008-5ff0-4b6c-a3a6-a07049e61c31 rack1 UN 10.10.10.221 192.8 KiB 3100.0% d312c083-8808-4c98-a3ab-72a7cd18b31f rack1 UN 10.10.10.224 266.61 KiB 3100.0% c562263f-4126-4935-b9f7-f4e7d0dc70b4 rack1 <<<<< == Logs from another node (10.10.10.221): INFO [HANDSHAKE-/10.10.10.224] 2018-02-03 11:33:01,397 OutboundTcpConnection.java:560 - Handshaking version with /10.10.10.224 INFO [GossipStage:1] 2018-02-03 11:33:01,431 Gossiper.java:1067 - Node /10.10.10.224 is now part of the cluster INFO [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 - InetAddress /10.10.10.224 is now UP INFO [RequestResponseStage-1] 2018-02-03 11:33:02,190 Gossiper.java:1031 - InetAddress /10.10.10.224 is now UP WARN [GossipStage:1] 2018-02-03 11:33:08,375 StorageService.java:2313 - Host ID collision for c562263f-4126-4935-b9f7-f4e7d0dc70b4 between /10.10.10.223 and /10.10.10.224; /10.10.10.224 is the new owner INFO [GossipTasks:1] 2018-02-03 11:33:08,806 Gossiper.java:810 - FatClient /10.10.10.223 has been silent for 3ms, removing from gossip == Logs from new node: INFO [main] 2018-02-03 11:33:01,926 StorageService.java:1442 - JOINING: Finish joining ring INFO [GossipStage:1] 2018-02-03 11:33:02,659 Gossiper.java:1067 - Node /10.10.10.223 is now part of the cluster WARN [GossipStage:1] 2018-02-03 11:33:02,676 StorageService.java:2307 - Not updating host ID c562263f-4126-4935-b9f7-f4e7d0dc70b4 for /10.10.10.223 because it's mine INFO [GossipStage:1] 2018-02-03 11:33:02,683 StorageService.java:2365 - Nodes /10.10.10.223 and /10.10.10.224 have the same token -7774421781914237508. Ignoring /10.10.10.223 INFO [GossipStage:1] 2018-02-03 11:33:02,686 StorageService.java:2365 - Nodes /10.10.10.223 and /10.10.10.224 have the same token 2257660731441815305. Ignoring /10.10.10.223 INFO [GossipStage:1] 2018-02-03 11:33:02,692 StorageService.java:2365 - Nodes /10.10.10.223 and /10.10.10.224 have the same token 51879124242594885. Ignoring /10.10.10.223 WARN [GossipTasks:1] 2018-02-03 11:33:03,985 Gossiper.java:789 - Gossip stage has 5 pending tasks; skipping status check (no nodes will be marked down) INFO [main] 2018-02-03 11:33:04,394 SecondaryIndexManager.java:509 - Executing pre-join tasks for: CFS(Keyspace='test', ColumnFamily='usr') WARN [GossipTasks:1] 2018-02-03 11:33:05,088 Gossiper.java:789 - Gossip stage has 7 pending tasks; skipping status c
Re: Cassandra 2.1: replace running node without streaming
ava:1046 - InetAddress /10.10.10.222 is now DOWN<<<<< have no idea why this appeared in logs INFO [main] 2018-02-03 11:33:20,566 NativeTransportService.java:70 - Netty using native Epoll event loop INFO [HANDSHAKE-/10.10.10.222] 2018-02-03 11:33:20,714 OutboundTcpConnection.java:560 - Handshaking version with /10.10.10.222 Kind Regards, Kyrill From: Oleksandr Shulgin Sent: Saturday, February 3, 2018 10:44:26 AM To: User Subject: Re: Cassandra 2.1: replace running node without streaming On 3 Feb 2018 08:49, "Jürgen Albersdorfer" mailto:jalbersdor...@gmail.com>> wrote: Cool, good to know. Do you know this is still true for 3.11.1? Well, I've never tried with that specific version, but this is pretty fundamental, so I would expect it to work the same way. Test in isolation if you want to be sure, though. I don't think this is documented anywhere, however, since I had the same doubts before seeing it worked for the first time. -- Alex Am 03.02.2018 um 08:19 schrieb Oleksandr Shulgin mailto:oleksandr.shul...@zalando.de>>: On 3 Feb 2018 02:42, "Kyrylo Lebediev" mailto:kyrylo_lebed...@epam.com>> wrote: Thanks, Oleksandr, In my case I'll need to replace all nodes in the cluster (one-by-one), so streaming will introduce perceptible overhead. My question is not about data movement/copy itself, but more about all this token magic. Okay, let's say we stopped old node, moved data to new node. Once it's started with auto_bootstrap=false it will be added to the cluster like an usual node, just skipping streaming stage, right? For a cluster with vnodes enabled, during addition of new node its token ranges are calculated automatically by C* on startup. So, how will C* know that this new node must be responsible for exactly the same token ranges as the old node was? How would the rest of nodes in the cluster ('peers') figure out that old node should be replaced in ring by the new one? Do you know about some limitation for this process in case of C* 2.1.x with vnodes enabled? A node stores its tokens and host id in the system.local table. Next time it starts up, it will use the same tokens as previously and the host id allows the rest of the cluster to see that it is the same node and ignore the IP address change. This happens regardless of auto_bootstrap setting. Try "select * from system.local" to see what is recorded for the old node. When the new node starts up it should log "Using saved tokens" with the list of numbers. Other nodes should log something like "ignoring IP address change" for the affected node addresses. Be careful though, to make sure that you put the data directory exactly where the new node expects to find it: otherwise it might just join as a brand new one, allocating new tokens. As a precaution it helps to ensure that the system user running the Cassandra process has no permission to create the data directory: this should stop the startup in case of misconfiguration. Cheers, -- Alex
Re: Cassandra 2.1: replace running node without streaming
On 3 Feb 2018 08:49, "Jürgen Albersdorfer" wrote: Cool, good to know. Do you know this is still true for 3.11.1? Well, I've never tried with that specific version, but this is pretty fundamental, so I would expect it to work the same way. Test in isolation if you want to be sure, though. I don't think this is documented anywhere, however, since I had the same doubts before seeing it worked for the first time. -- Alex Am 03.02.2018 um 08:19 schrieb Oleksandr Shulgin < oleksandr.shul...@zalando.de>: On 3 Feb 2018 02:42, "Kyrylo Lebediev" wrote: Thanks, Oleksandr, In my case I'll need to replace all nodes in the cluster (one-by-one), so streaming will introduce perceptible overhead. My question is not about data movement/copy itself, but more about all this token magic. Okay, let's say we stopped old node, moved data to new node. Once it's started with auto_bootstrap=false it will be added to the cluster like an usual node, just skipping streaming stage, right? For a cluster with vnodes enabled, during addition of new node its token ranges are calculated automatically by C* on startup. So, how will C* know that this new node must be responsible for exactly the same token ranges as the old node was? How would the rest of nodes in the cluster ('peers') figure out that old node should be replaced in ring by the new one? Do you know about some limitation for this process in case of C* 2.1.x with vnodes enabled? A node stores its tokens and host id in the system.local table. Next time it starts up, it will use the same tokens as previously and the host id allows the rest of the cluster to see that it is the same node and ignore the IP address change. This happens regardless of auto_bootstrap setting. Try "select * from system.local" to see what is recorded for the old node. When the new node starts up it should log "Using saved tokens" with the list of numbers. Other nodes should log something like "ignoring IP address change" for the affected node addresses. Be careful though, to make sure that you put the data directory exactly where the new node expects to find it: otherwise it might just join as a brand new one, allocating new tokens. As a precaution it helps to ensure that the system user running the Cassandra process has no permission to create the data directory: this should stop the startup in case of misconfiguration. Cheers, -- Alex
Re: Cassandra 2.1: replace running node without streaming
Cool, good to know. Do you know this is still true for 3.11.1? > Am 03.02.2018 um 08:19 schrieb Oleksandr Shulgin > : > > On 3 Feb 2018 02:42, "Kyrylo Lebediev" wrote: > Thanks, Oleksandr, > In my case I'll need to replace all nodes in the cluster (one-by-one), so > streaming will introduce perceptible overhead. > My question is not about data movement/copy itself, but more about all this > token magic. > > Okay, let's say we stopped old node, moved data to new node. > Once it's started with auto_bootstrap=false it will be added to the cluster > like an usual node, just skipping streaming stage, right? > For a cluster with vnodes enabled, during addition of new node its token > ranges are calculated automatically by C* on startup. > > So, how will C* know that this new node must be responsible for exactly the > same token ranges as the old node was? > How would the rest of nodes in the cluster ('peers') figure out that old node > should be replaced in ring by the new one? > Do you know about some limitation for this process in case of C* 2.1.x with > vnodes enabled? > > A node stores its tokens and host id in the system.local table. Next time it > starts up, it will use the same tokens as previously and the host id allows > the rest of the cluster to see that it is the same node and ignore the IP > address change. This happens regardless of auto_bootstrap setting. > > Try "select * from system.local" to see what is recorded for the old node. > When the new node starts up it should log "Using saved tokens" with the list > of numbers. Other nodes should log something like "ignoring IP address > change" for the affected node addresses. > > Be careful though, to make sure that you put the data directory exactly where > the new node expects to find it: otherwise it might just join as a brand new > one, allocating new tokens. As a precaution it helps to ensure that the > system user running the Cassandra process has no permission to create the > data directory: this should stop the startup in case of misconfiguration. > > Cheers, > -- > Alex >
Re: Cassandra 2.1: replace running node without streaming
On 3 Feb 2018 02:42, "Kyrylo Lebediev" wrote: Thanks, Oleksandr, In my case I'll need to replace all nodes in the cluster (one-by-one), so streaming will introduce perceptible overhead. My question is not about data movement/copy itself, but more about all this token magic. Okay, let's say we stopped old node, moved data to new node. Once it's started with auto_bootstrap=false it will be added to the cluster like an usual node, just skipping streaming stage, right? For a cluster with vnodes enabled, during addition of new node its token ranges are calculated automatically by C* on startup. So, how will C* know that this new node must be responsible for exactly the same token ranges as the old node was? How would the rest of nodes in the cluster ('peers') figure out that old node should be replaced in ring by the new one? Do you know about some limitation for this process in case of C* 2.1.x with vnodes enabled? A node stores its tokens and host id in the system.local table. Next time it starts up, it will use the same tokens as previously and the host id allows the rest of the cluster to see that it is the same node and ignore the IP address change. This happens regardless of auto_bootstrap setting. Try "select * from system.local" to see what is recorded for the old node. When the new node starts up it should log "Using saved tokens" with the list of numbers. Other nodes should log something like "ignoring IP address change" for the affected node addresses. Be careful though, to make sure that you put the data directory exactly where the new node expects to find it: otherwise it might just join as a brand new one, allocating new tokens. As a precaution it helps to ensure that the system user running the Cassandra process has no permission to create the data directory: this should stop the startup in case of misconfiguration. Cheers, -- Alex
Re: Cassandra 2.1: replace running node without streaming
Thanks, Oleksandr, In my case I'll need to replace all nodes in the cluster (one-by-one), so streaming will introduce perceptible overhead. My question is not about data movement/copy itself, but more about all this token magic. Okay, let's say we stopped old node, moved data to new node. Once it's started with auto_bootstrap=false it will be added to the cluster like an usual node, just skipping streaming stage, right? For a cluster with vnodes enabled, during addition of new node its token ranges are calculated automatically by C* on startup. So, how will C* know that this new node must be responsible for exactly the same token ranges as the old node was? How would the rest of nodes in the cluster ('peers') figure out that old node should be replaced in ring by the new one? Do you know about some limitation for this process in case of C* 2.1.x with vnodes enabled? Regards, Kyrill From: Oleksandr Shulgin Sent: Friday, February 2, 2018 4:26:30 PM To: User Subject: Re: Cassandra 2.1: replace running node without streaming On Fri, Feb 2, 2018 at 3:15 PM, Kyrylo Lebediev mailto:kyrylo_lebed...@epam.com>> wrote: Hello All! I've got a pretty standard task - to replace a running C* node [version 2.1.15, vnodes=256, Ec2Snitch] (IP address will change after replacement, have no control over it). There are 2 ways stated in C* documentation how this can be done: 1) Add a new node, than 'nodetool decommission' [ = 2 data streaming + 2 token range recalculations], 2) Stop the node then replace it by setting -Dcassandra.replace_address [ = 1 data streaming] Unfortunately, both these methods imply data streaming. Is there a supported solution how to replace a live healthy node without data streaming / bootstrapping? Something like: "Stop old node, copy data to new node, start new node with auto_bootstrap=false etc..." On EC2, if you're using EBS it's pretty easy: drain and stop the old node, attach the volume to the new one and start it. If not using EBS, then you have to copy the data to the new node before it is started. I was able to find a couple manuals on the Internet, like this one: http://engineering.mydrivesolutions.com/posts/cassandra_nodes_replacement/, but not having understanding of C* internals, I don't know if such hacks are safe. More or less like that: rsync while the old node is still running, then stop the node and rsync again. But given all the hassle, streaming with replace_address doesn't sound too costly to me. Cheers, -- Alex
Re: Cassandra 2.1: replace running node without streaming
On Fri, Feb 2, 2018 at 3:15 PM, Kyrylo Lebediev wrote: > Hello All! > > I've got a pretty standard task - to replace a running C* node [version > 2.1.15, vnodes=256, Ec2Snitch] (IP address will change after replacement, > have no control over it). > > There are 2 ways stated in C* documentation how this can be done: > > 1) Add a new node, than 'nodetool decommission' [ = 2 data streaming + 2 > token range recalculations], > > 2) Stop the node then replace it by setting -Dcassandra.replace_address [ > = 1 data streaming] > Unfortunately, both these methods imply data streaming. > > Is there a supported solution how to replace a live healthy node without > data streaming / bootstrapping? > Something like: "Stop old node, copy data to new node, start new node with > auto_bootstrap=false etc..." > On EC2, if you're using EBS it's pretty easy: drain and stop the old node, attach the volume to the new one and start it. If not using EBS, then you have to copy the data to the new node before it is started. > I was able to find a couple manuals on the Internet, like this one: > http://engineering.mydrivesolutions.com/posts/cassandra_nodes_replacement/, > but not having understanding of C* internals, I don't know if such hacks > are safe. > More or less like that: rsync while the old node is still running, then stop the node and rsync again. But given all the hassle, streaming with replace_address doesn't sound too costly to me. Cheers, -- Alex