Chris, Thanks a lot for the info. I'll explore both options. _Vitaly > On December 18, 2018 at 11:13 AM Chris Walker <[email protected]> wrote: > > > Looks like rhino66-left was scheduled for fencing because it was not present > 20 seconds (the dc-deadtime parameter) after rhino66-right started Pacemaker > (startup fencing). I can think of a couple of ways to allow all nodes to > survive if they come up far apart in time (i.e., father apart than > dc-deadtime): > > 1. Increase dc-deadtime. Unfortunately, the cluster always waits for > dc-deadtime to expire before starting resources, so this can delay your > cluster's startup. > > 2. As Ken mentioned, synchronize the starting of Corosync and Pacemaker. I > did this with a simple ExecStartPre systemd script: > > [root@bug0 ~]# cat /etc/systemd/system/corosync.service.d/ha_wait.conf > [Service] > ExecStartPre=/sbin/ha_wait.sh > TimeoutStartSec=11min > [root@bug0 ~]# > > where ha_wait.sh has something like: > > #!/bin/bash > > timeout=600 > > peer=<hostname of HA peer> > > echo "Waiting for ${peer}" > peerup() { > systemctl -H ${peer} show -p ActiveState corosync.service 2> /dev/null | \ > egrep -q "=active|=reloading|=failed|=activating|=deactivating" && return > 0 > return 1 > } > > now=${SECONDS} > while ! peerup && [ $((SECONDS-now)) -lt ${timeout} ]; do > echo -n . > sleep 5 > done > > peerup && echo "${peer} is up starting HA" || echo "${peer} not up after > ${timeout} starting HA alone" > > > This will cause corosync startup to block for 10 minutes waiting for the > partner node to come up, after which both nodes will start corosync/pacemaker > close in time. If one node never comes up, then it will wait 10 minutes > before starting, after which the other node will be fenced (startup fencing > and subsequent resource startup will only happen will only occur if > no-quorum-policy is set to ignore) > > HTH, > > Chris > > On 12/17/18 6:25 PM, Vitaly Zolotusky wrote: > > Ken, Thank you very much for quick response! > I do have "two_node: 1" in the corosync.conf. I have attached it to this > email (not from the same system as original messages, but they are all the > same). > Syncing startup of corosync and pacemaker on different nodes would be a > problem for us. > I suspect that the problem is that corosync assumes quorum is reached as soon > as corosync is started on both nodes, but pacemaker does not abort fencing > until pacemaker starts on the other node. > > I will try to work around this issue by moving corosync and pacemaker > startups on single node as close to each other as possible. > Thanks again! > _Vitaly > > > > On December 17, 2018 at 6:01 PM Ken Gaillot > <[email protected]><mailto:[email protected]> wrote: > > > On Mon, 2018-12-17 at 15:43 -0500, Vitaly Zolotusky wrote: > > > Hello, > I have a 2 node cluster and stonith is configured for SBD and > fence_ipmilan. > fence_ipmilan for node 1 is configured for 0 delay and for node 2 for > 30 sec delay so that nodes do not start killing each other during > startup. > > > > If you're using corosync 2 or later, you can set "two_node: 1" in > corosync.conf. That implies the wait_for_all option, so that at start- > up, both nodes must be present before quorum can be reached the first > time. (After that point, one node can go away and quorum will be > retained.) > > Another way to avoid this is to start corosync on all nodes, then start > pacemaker on all nodes. > > > > In some cases (usually right after installation and when node 1 comes > up first and node 2 second) the node that comes up first (node 1) > states that node 2 is unclean, but can't fence it until quorum > reached. > Then as soon as quorum is reached after startup of corosync on node 2 > it sends a fence request for node 2. > Fence_ipmilan gets into 30 sec delay. > Pacemaker gets started on node 2. > While fence_ipmilan is still waiting for the delay node 1 crmd aborts > transition that requested the fence. > Even though the transition was aborted, when delay time expires node > 2 gets fenced. > > > > Currently, pacemaker has no way of cancelling fencing once it's been > initiated. Technically, it would be possible to cancel an operation > that's in the delay stage (assuming that no other fence device has > already been attempted, if there are more than one), but that hasn't > been implemented. > > > > Excerpts from messages are below. I also attached messages from both > nodes and pe-input files from node 1. > Any suggestions would be appreciated. > Thank you very much for your help! > Vitaly Zolotusky > > Here are excerpts from the messages: > > Node 1 - controller - rhino66-right 172.18.51.81 - came up > first ***************** > > Nov 29 16:47:54 rhino66-right pengine[22183]: warning: Fencing and > resource management disabled due to lack of quorum > Nov 29 16:47:54 rhino66-right pengine[22183]: warning: Node rhino66- > left.lab.archivas.com is unclean! > Nov 29 16:47:54 rhino66-right pengine[22183]: notice: Cannot fence > unclean nodes until quorum is attained (or no-quorum-policy is set to > ignore) > ..... > Nov 29 16:48:38 rhino66-right corosync[6677]: [TOTEM ] A new > membership (172.16.1.81:60) was formed. Members joined: 2 > Nov 29 16:48:38 rhino66-right corosync[6677]: [VOTEQ ] Waiting for > all cluster members. Current votes: 1 expected_votes: 2 > Nov 29 16:48:38 rhino66-right corosync[6677]: [QUORUM] This node is > within the primary component and will provide service. > Nov 29 16:48:38 rhino66-right corosync[6677]: [QUORUM] Members[2]: > 1 2 > Nov 29 16:48:38 rhino66-right corosync[6677]: [MAIN ] Completed > service synchronization, ready to provide service. > Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Quorum acquired > Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Quorum > acquired > Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Could not obtain > a node name for corosync nodeid 2 > Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Could not > obtain a node name for corosync nodeid 2 > Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Could not obtain > a node name for corosync nodeid 2 > Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Node (null) > state is now member > Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Could not > obtain a node name for corosync nodeid 2 > Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Node > (null) state is now member > Nov 29 16:48:54 rhino66-right crmd[22184]: notice: State transition > S_IDLE -> S_POLICY_ENGINE > Nov 29 16:48:54 rhino66-right pengine[22183]: notice: Watchdog will > be used via SBD if fencing is required > Nov 29 16:48:54 rhino66-right pengine[22183]: warning: Scheduling > Node rhino66-left.lab.archivas.com for STONITH > Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * Fence > (reboot) rhino66-left.lab.archivas.com 'node is unclean' > Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * > Start fence_sbd ( rhino66-right.lab.archivas.com ) > Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * > Start ipmi-82 ( rhino66-right.lab.archivas.com ) > Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * > Start S_IP ( rhino66-right.lab.archivas.com ) > Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * > Start postgres:0 ( rhino66-right.lab.archivas.com ) > Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * > Start ethmonitor:0 ( rhino66-right.lab.archivas.com ) > Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * > Start fs_monitor:0 ( rhino66-right.lab.archivas.com > ) due to unrunnable DBMaster running (blocked) > Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * > Start tomcat-instance:0 ( rhino66-right.lab.archivas.com > ) due to unrunnable DBMaster running (blocked) > Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * > Start ClusterMonitor:0 ( rhino66-right.lab.archivas.com > ) due to unrunnable DBMaster running (blocked) > Nov 29 16:48:54 rhino66-right pengine[22183]: warning: Calculated > transition 5 (with warnings), saving inputs in > /var/lib/pacemaker/pengine/pe-warn-9.bz2 > > Nov 29 16:48:54 rhino66-right crmd[22184]: notice: Requesting > fencing (reboot) of node rhino66-left.lab.archivas.com > Nov 29 16:48:54 rhino66-right stonith-ng[22178]: notice: Client > crmd.22184.4aa20c13 wants to fence (reboot) 'rhino66- > left.lab.archivas.com' with device '(any)' > Nov 29 16:48:54 rhino66-right stonith-ng[22178]: notice: Requesting > peer fencing (reboot) of rhino66-left.lab.archivas.com > Nov 29 16:48:54 rhino66-right stonith-ng[22178]: notice: fence_sbd > can fence (reboot) rhino66-left.lab.archivas.com: static-list > Nov 29 16:48:54 rhino66-right stonith-ng[22178]: notice: ipmi-82 > can fence (reboot) rhino66-left.lab.archivas.com: static-list > Nov 29 16:48:54 rhino66-right /fence_ipmilan: Delay 30 second(s) > before logging in to the fence device > ..... > Nov 29 16:49:14 rhino66-right cib[22177]: notice: Node (null) state > is now member > Nov 29 16:49:15 rhino66-right crmd[22184]: notice: Could not obtain > a node name for corosync nodeid 2 > Nov 29 16:49:16 rhino66-right crmd[22184]: notice: Transition > aborted: Node join > > Nov 29 16:49:20 rhino66-right cib[22177]: notice: Local CIB > 1.33.2.abc5436abfebbac946f69a2776e7a73a differs from rhino66- > left.lab.archivas.com: 1.31.0.62aaa721579d5d8189b01b400534dc05 > 0x5568eb66c590 > Nov 29 16:49:25 rhino66-right /fence_ipmilan: Executing: > /usr/bin/ipmitool -I lanplus -H 172.16.1.2 -p 623 -U admin -P [set] > -L ADMINISTRATOR chassis power off > > Node 2 - rhino66-left 172.18.51.82 - comes up second ******** > > Nov 29 16:48:38 rhino66-left systemd[1]: Starting Corosync Cluster > Engine... > Nov 29 16:48:38 rhino66-left corosync[6217]: [MAIN ] Corosync > Cluster Engine ('2.4.4'): started and ready to provide service. > Nov 29 16:48:38 rhino66-left corosync[6217]: [MAIN ] Corosync > built-in features: dbus rdma systemd xmlconf qdevices qnetd snmp > libcgroup pie relro bindnow > Nov 29 16:48:38 rhino66-left corosync[6217]: [MAIN ] interface > section bindnetaddr is used together with nodelist. Nodelist one is > going to be used. > Nov 29 16:48:38 rhino66-left corosync[6217]: [MAIN ] Please > migrate config file to nodelist. > Nov 29 16:48:38 rhino66-left corosync[6218]: [TOTEM ] Initializing > transport (UDP/IP Unicast). > Nov 29 16:48:38 rhino66-left corosync[6218]: [TOTEM ] Initializing > transmit/receive security (NSS) crypto: aes256 hash: sha256 > Nov 29 16:48:38 rhino66-left corosync[6218]: [TOTEM ] The network > interface [172.16.1.82] is now up. > Nov 29 16:48:38 rhino66-left corosync[6218]: [SERV ] Service > engine loaded: corosync configuration map access [0] > Nov 29 16:48:38 rhino66-left corosync[6218]: [QB ] server name: > cmap > Nov 29 16:48:38 rhino66-left corosync[6218]: [SERV ] Service > engine loaded: corosync configuration service [1] > Nov 29 16:48:38 rhino66-left corosync[6218]: [QB ] server name: > cfg > Nov 29 16:48:38 rhino66-left corosync[6218]: [SERV ] Service > engine loaded: corosync cluster closed process group service v1.01 > [2] > Nov 29 16:48:38 rhino66-left corosync[6218]: [QB ] server name: > cpg > Nov 29 16:48:38 rhino66-left corosync[6218]: [SERV ] Service > engine loaded: corosync profile loading service [4] > Nov 29 16:48:38 rhino66-left corosync[6218]: [QUORUM] Using quorum > provider corosync_votequorum > Nov 29 16:48:38 rhino66-left corosync[6218]: [VOTEQ ] Waiting for > all cluster members. Current votes: 1 expected_votes: 2 > Nov 29 16:48:38 rhino66-left corosync[6218]: [SERV ] Service > engine loaded: corosync vote quorum service v1.0 [5] > Nov 29 16:48:38 rhino66-left corosync[6218]: [QB ] server name: > votequorum > Nov 29 16:48:38 rhino66-left corosync[6218]: [SERV ] Service > engine loaded: corosync cluster quorum service v0.1 [3] > Nov 29 16:48:38 rhino66-left corosync[6218]: [QB ] server name: > quorum > Nov 29 16:48:38 rhino66-left corosync[6218]: [TOTEM ] adding new > UDPU member {172.16.1.81} > Nov 29 16:48:38 rhino66-left corosync[6218]: [TOTEM ] adding new > UDPU member {172.16.1.82} > Nov 29 16:48:38 rhino66-left corosync[6218]: [TOTEM ] A new > membership (172.16.1.82:56) was formed. Members joined: 2 > Nov 29 16:48:38 rhino66-left corosync[6218]: [VOTEQ ] Waiting for > all cluster members. Current votes: 1 expected_votes: 2 > Nov 29 16:48:38 rhino66-left corosync[6218]: [VOTEQ ] Waiting for > all cluster members. Current votes: 1 expected_votes: 2 > Nov 29 16:48:38 rhino66-left corosync[6218]: [VOTEQ ] Waiting for > all cluster members. Current votes: 1 expected_votes: 2 > Nov 29 16:48:38 rhino66-left corosync[6218]: [QUORUM] Members[1]: 2 > Nov 29 16:48:38 rhino66-left corosync[6218]: [MAIN ] Completed > service synchronization, ready to provide service. > Nov 29 16:48:38 rhino66-left corosync[6218]: [TOTEM ] A new > membership (172.16.1.81:60) was formed. Members joined: 1 > Nov 29 16:48:38 rhino66-left corosync[6218]: [QUORUM] This node is > within the primary component and will provide service. > Nov 29 16:48:38 rhino66-left corosync[6218]: [QUORUM] Members[2]: 1 > 2 > Nov 29 16:48:38 rhino66-left corosync[6218]: [MAIN ] Completed > service synchronization, ready to provide service. > Nov 29 16:48:38 rhino66-left corosync[6207]: Starting Corosync > Cluster Engine (corosync): [ OK ] > Nov 29 16:48:38 rhino66-left systemd[1]: Started Corosync Cluster > Engine. > ...... > Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Starting > Pacemaker 1.1.18-2.fc28.1 > Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Could not > obtain a node name for corosync nodeid 2 > Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Quorum > acquired > Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Defaulting > to uname -n for the local corosync node name > Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Could not > obtain a node name for corosync nodeid 1 > Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Could not > obtain a node name for corosync nodeid 1 > Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Node (null) > state is now member > Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Node > rhino66-left.lab.archivas.com state is now member > Nov 29 16:49:14 rhino66-left stonith-ng[21176]: notice: Connecting > to cluster infrastructure: corosync > Nov 29 16:49:14 rhino66-left attrd[21178]: notice: Connecting to > cluster infrastructure: corosync > Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Could not > obtain a node name for corosync nodeid 1 > Nov 29 16:49:14 rhino66-left cib[21175]: notice: Connecting to > cluster infrastructure: corosync > Nov 29 16:49:14 rhino66-left stonith-ng[21176]: notice: Could not > obtain a node name for corosync nodeid 2 > Nov 29 16:49:14 rhino66-left stonith-ng[21176]: notice: Node (null) > state is now member > Nov 29 16:49:14 rhino66-left covermon.sh[21154]: 2018-11-29 > 16:49:14,464 INFO at covermon.py line 49 [MainThread:21154] New > enclosure discovered: sgDev=/dev/sg7 serial=SGFTJ18263C8B07 > Nov 29 16:49:14 rhino66-left attrd[21178]: notice: Could not obtain > a node name for corosync nodeid 2 > Nov 29 16:49:14 rhino66-left attrd[21178]: notice: Node (null) > state is now member > ...... > Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Connecting to > cluster infrastructure: corosync > Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Could not obtain > a node name for corosync nodeid 2 > Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Defaulting to > uname -n for the local corosync node name > Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Quorum acquired > Nov 29 16:49:15 rhino66-left cib[21175]: notice: Defaulting to > uname -n for the local corosync node name > Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Could not obtain > a node name for corosync nodeid 1 > Nov 29 16:49:15 rhino66-left attrd[21178]: notice: Could not obtain > a node name for corosync nodeid 1 > Nov 29 16:49:15 rhino66-left stonith-ng[21176]: notice: Could not > obtain a node name for corosync nodeid 1 > Nov 29 16:49:15 rhino66-left attrd[21178]: notice: Node (null) > state is now member > Nov 29 16:49:15 rhino66-left stonith-ng[21176]: notice: Node (null) > state is now member > Nov 29 16:49:15 rhino66-left stonith-ng[21176]: notice: Watchdog > will be used via SBD if fencing is required > Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Could not obtain > a node name for corosync nodeid 1 > Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Node (null) state > is now member > Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Node rhino66- > left.lab.archivas.com state is now member > Nov 29 16:49:15 rhino66-left attrd[21178]: notice: Defaulting to > uname -n for the local corosync node name > Nov 29 16:49:15 rhino66-left attrd[21178]: notice: Recorded > attribute writer: rhino66-right.lab.archivas.com > Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Defaulting to > uname -n for the local corosync node name > Nov 29 16:49:15 rhino66-left crmd[21180]: notice: The local CRM is > operational > Nov 29 16:49:15 rhino66-left crmd[21180]: notice: State transition > S_STARTING -> S_PENDING > Nov 29 16:49:16 rhino66-left stonith-ng[21176]: notice: Added > 'fence_sbd' to the device list (1 active devices) > Nov 29 16:49:16 rhino66-left crmd[21180]: notice: Could not obtain > a node name for corosync nodeid 1 > Nov 29 16:49:17 rhino66-left stonith-ng[21176]: notice: Added > 'ipmi-81' to the device list (2 active devices) > Nov 29 16:49:20 rhino66-left attrd[21178]: notice: Updating all > attributes after cib_refresh_notify event > > ...... > > Nov 29 16:49:27 rhino66-left systemd-logind[1539]: System is powering > down. > _______________________________________________ > Users mailing list: [email protected]<mailto:[email protected]> > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Users mailing list: [email protected]<mailto:[email protected]> > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Users mailing list: [email protected]<mailto:[email protected]> > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Users mailing list: [email protected] > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
