On Jan 11, 2019, at 3:53 AM, Jan Pokorný <[email protected]<mailto:[email protected]>> wrote:
On 11/01/19 00:16 +0000, Israel Brewster wrote: On Jan 10, 2019, at 10:57 AM, Israel Brewster <[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote: So in my ongoing work to upgrade my cluster to CentOS 7, I got one box up and running on CentOS 7, with the cluster fully configured and functional, and moved all my services over to it. Now I'm trying to add a second node, following the directions here: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/s1-clusternodemanage-haar#s2-nodeadd-HAAR However, it doesn't appear to be working. The existing node is named "follow3", and the new node I am trying to add is named "follow1": - The auth command run from follow3 returns "follow1: Authorized", so that looks good. - The "pcs cluster node add follow1" command, again run on follow3, gives the following output: Disabling SBD service... follow1: sbd disabled Sending remote node configuration files to 'follow1' follow1: successful distribution of the file 'pacemaker_remote authkey' follow3: Corosync updated Setting up corosync... follow1: Succeeded Synchronizing pcsd certificates on nodes follow1... follow1: Success Restarting pcsd on the nodes in order to reload the certificates... follow1: Success ...So it would appear that that worked as well. I then issued the "pcs cluster start --all" command, which gave the following output: [root@follow3 ~]# pcs cluster start --all follow3: Starting Cluster (corosync)... follow1: Starting Cluster (corosync)... follow3: Starting Cluster (pacemaker)... follow1: Starting Cluster (pacemaker)... So again, everything looks good (to me). However, when I run "pcs status" on the existing node, I get the following: [root@follow3 ~]# pcs status Cluster name: follow Stack: corosync Current DC: follow3 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition with quorum Last updated: Thu Jan 10 10:47:33 2019 Last change: Wed Jan 9 21:39:37 2019 by root via cibadmin on follow3 1 node configured 29 resources configured Online: [ follow3 ] Full list of resources: which would seem to indicate that it doesn't know about the node I just added (follow1). Meanwhile, follow1 "pcs status" shows this: [root@follow1 ~]# pcs status Cluster name: follow Stack: corosync Current DC: follow1 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition WITHOUT quorum Last updated: Thu Jan 10 10:54:25 2019 Last change: Thu Jan 10 10:54:13 2019 by root via cibadmin on follow1 2 nodes configured 0 resources configured Online: [ follow1 ] OFFLINE: [ follow3 ] No resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled So it got at least *some* of the config, but apparently not the full thing (no resources), and it shows follow3 as offline, even though it is online and reachable. Oddly "pcs cluster status" shows both follow1 and follow3 pcsd status as online. What am I missing here? As a follow-up to the above, restarting corosync on the functioning node (follow3) at least allows the second node (follow1) to show up when I do a pcs status, however the second node still shows as OFFLINE (and follow3 shows as offline on follow1), and follow1 is still missing pretty much all of the config. If I try to remove and re-add follow1, the removal works as expected (node count on follow3 drops to 1), but the add behaves exactly the same as before, with pcs status not acknowledging the added node. What do the logs on follow1 have to say about this? E.g. journalctl -b --no-hostname -u corosync -u pacemaker, focusing on the respective suspect time. If there's nothing sufficiently explaining what actually happened, you can still review the underlying pcs communication itself if you pass --debug to it. I suspect that simply one corosync instance doesn't see the other for whatever reason (firewall, bad addresses or not on the same network at all, addresses out of sync between particular nodes, in corosync.conf, or possibly even in /etc/hosts or DNS source, ...). So apparently this was something messed up on Follow3, although I don't know what. I ended up doing the following, which worked: 1) Set up a new VM ('follow4') 2) cluster it with follow1 3) Dump JUST the resources and constraints from follow3 4) load the above .xml files to the new cluster (follow1 and follow4) Once I did the above, I was able to add an additional node (follow2) to the new follow1/follow4 cluster with no problems. So while I don't know what was going on with follow3, at least I now have a properly functioning cluster again! -- Nazdar, Jan (Poki) _______________________________________________ Users mailing list: [email protected]<mailto:[email protected]> https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
_______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
