On Thu, Apr 21, 2022 at 8:18 PM john tillman <[email protected]> wrote: > > > On 21.04.2022 18:26, john tillman wrote: > >>> Dne 20. 04. 22 v 20:21 john tillman napsal(a): > >>>>> On 20.04.2022 19:53, john tillman wrote: > >>>>>> I have a two node cluster that won't start any resources if only one > >>>>>> node > >>>>>> is booted; the pacemaker service does not start. > >>>>>> > >>>>>> Once the second node boots up, the first node will start pacemaker > >>>>>> and > >>>>>> the > >>>>>> resources are started. All is well. But I would like the resources > >>>>>> to > >>>>>> start when the first node boots by itself. > >>>>>> > >>>>>> I thought the problem was with the wait_for_all option but I have it > >>>>>> set > >>>>>> to "0". > >>>>>> > >>>>>> On the node that is booted by itself, when I run > >>>>>> "corosync-quorumtool" > >>>>>> I > >>>>>> see: > >>>>>> > >>>>>> [root@test00 ~]# corosync-quorumtool > >>>>>> Quorum information > >>>>>> ------------------ > >>>>>> Date: Wed Apr 20 16:05:07 2022 > >>>>>> Quorum provider: corosync_votequorum > >>>>>> Nodes: 1 > >>>>>> Node ID: 1 > >>>>>> Ring ID: 1.2f > >>>>>> Quorate: Yes > >>>>>> > >>>>>> Votequorum information > >>>>>> ---------------------- > >>>>>> Expected votes: 2 > >>>>>> Highest expected: 2 > >>>>>> Total votes: 1 > >>>>>> Quorum: 1 > >>>>>> Flags: 2Node Quorate > >>>>>> > >>>>>> Membership information > >>>>>> ---------------------- > >>>>>> Nodeid Votes Name > >>>>>> 1 1 test00 (local) > >>>>>> > >>>>>> > >>>>>> My config file look like this: > >>>>>> totem { > >>>>>> version: 2 > >>>>>> cluster_name: testha > >>>>>> transport: knet > >>>>>> crypto_cipher: aes256 > >>>>>> crypto_hash: sha256 > >>>>>> } > >>>>>> > >>>>>> nodelist { > >>>>>> node { > >>>>>> ring0_addr: test00 > >>>>>> name: test00 > >>>>>> nodeid: 1 > >>>>>> } > >>>>>> > >>>>>> node { > >>>>>> ring0_addr: test01 > >>>>>> name: test01 > >>>>>> nodeid: 2 > >>>>>> } > >>>>>> } > >>>>>> > >>>>>> quorum { > >>>>>> provider: corosync_votequorum > >>>>>> two_node: 1 > >>>>>> wait_for_all: 0 > >>>>>> } > >>>>>> > >>>>>> logging { > >>>>>> to_logfile: yes > >>>>>> logfile: /var/log/cluster/corosync.log > >>>>>> to_syslog: yes > >>>>>> timestamp: on > >>>>>> debug: on > >>>>>> syslog_priority: debug > >>>>>> logfile_priority: debug > >>>>>> } > >>>>>> > >>>>>> Fencing is disabled. > >>>>>> > >>>>> > >>>>> That won't work. > >>>>> > >>>>>> I've also looked in "corosync.log" but I don't know what to look for > >>>>>> to > >>>>>> diagnose this issue. I mean there are many lines similar to: > >>>>>> [QUORUM] This node is within the primary component and will provide > >>>>>> service. > >>>>>> and > >>>>>> [VOTEQ ] Sending quorum callback, quorate = 1 > >>>>>> and > >>>>>> [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: Yes > >>>>>> Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: > >>>>>> No > >>>>>> > >>>>>> Is there something specific I should look for in the log? > >>>>>> > >>>>>> So can a two node cluster work after booting only one node? Maybe > >>>>>> it > >>>>>> never will and I am wasting a lot of time, yours and mine. > >>>>>> > >>>>>> If it can, what else can I investigate further? > >>>>>> > >>>>> > >>>>> Before node can start handling resources it needs to know status of > >>>>> other node. Without successful fencing there is no way to accomplish > >>>>> it. > >>>>> > >>>>> Yes, you can tell pacemaker to ignore unknown status. Depending on > >>>>> your > >>>>> resources this could simply prevent normal work or lead to data > >>>>> corruption. > >>>> > >>>> > >>>> Makes sense. Thank you. > >>>> > >>>> Perhaps some future enhancement could allow for this situation? I > >>>> mean, > >>>> It might be desirable for some cases to allow for a single node to > >>>> boot, > >>>> determine quorum by two_node=1 and wait_for_all=0, and start resources > >>>> without ever seeing the other node. Sure, there are dangers of split > >>>> brain but I can see special cases where I want the node to work alone > >>>> for > >>>> a period of time despite the danger. > >>>> > >>> > >>> Hi John, > >>> > >>> How about 'pcs quorum unblock'? > >>> > >>> Regards, > >>> Tomas > >>> > >> > >> > >> Tomas, > >> > >> Thank you for the suggestion. However it didn't work. It returned: > >> Error: unable to check quorum status > >> crm_mon: Error: cluster is not available on this node > >> I checked pacemaker, just in case, and it still isn't running. > >> > > > > Either pacemaker or some service it depends upon attempted to start and > > failed or systemd still waits for some service that is required before > > pacemaker. Checks logs or provide "journalctl -b" output in this state. > > > > > > > I looked at pacemaker's log and it does not have any updates since the > system was shutdown. When we booted the node, if it had started and > failed or started and was stopped by systemd there would be something in > this log, no? > > journalctl -b is lengthy and I'd rather not attach here but I grep'd > through it and I can't find any pacemaker references. No errors reported > from systemd. > > Once the other node is started, something starts the pacemaker service. > pacemaker log starts filling up. journalctl -b sees plenty of pacemaker > entires. crm_mon and pcs status are working right and show the cluster in > a good state with all resources started properly. > > So I don't see anything stopping pacemaker from starting at boot. It > looks like some piece of cluster software is starting it once the second > node is online. Maybe corosync? Although the corosync log doesn't > mention the start of anything. All it logs is seeing the second node > join. > > So what starts pacemaker in this case?
Definitely a good question! With systemd-unit-files as usually distributed this behavior is hard to explain. So first thing I would check is pacemaker-unit-file (check all locations that might overrule the one that is coming with pacemaker). Maybe sbdy tried to implement something similar as wait-for-all by checking if corosync reports the cluster-partition as quorate before starting pacemaker. But actually I then would expect 'journalctl -u pacemaker' to show some sign of starting the unit. (should be found in 'journalctl -b' of course as well) Klaus > > Thank you for the response. > > -John > > > >> I very curious how I could convince the cluster to start its resources > >> on > >> one node in the event that the other node is not able to boot. But I'm > >> afraid the answer is either to use fencing or add a third node to the > >> cluster or both. > >> > >> -John > >> > >> > >>>> Thank you again. > >>>> > >>>> > >>>>> _______________________________________________ > >>>>> Manage your subscription: > >>>>> https://lists.clusterlabs.org/mailman/listinfo/users > >>>>> > >>>>> ClusterLabs home: https://www.clusterlabs.org/ > >>>>> > >>>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Manage your subscription: > >>>> https://lists.clusterlabs.org/mailman/listinfo/users > >>>> > >>>> ClusterLabs home: https://www.clusterlabs.org/ > >>>> > >>> > >>> _______________________________________________ > >>> Manage your subscription: > >>> https://lists.clusterlabs.org/mailman/listinfo/users > >>> > >>> ClusterLabs home: https://www.clusterlabs.org/ > >>> > >>> > >> > >> > >> _______________________________________________ > >> Manage your subscription: > >> https://lists.clusterlabs.org/mailman/listinfo/users > >> > >> ClusterLabs home: https://www.clusterlabs.org/ > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
