Re: [ClusterLabs] HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.

Vitaly Zolotusky Tue, 18 Dec 2018 10:13:25 -0800

Chris,
Thanks a lot for the info. I'll explore both options.
_Vitaly

> On December 18, 2018 at 11:13 AM Chris Walker <[email protected]> wrote:
> 
> 
> Looks like rhino66-left was scheduled for fencing because it was not present 
> 20 seconds (the dc-deadtime parameter) after rhino66-right started Pacemaker 
> (startup fencing).  I can think of a couple of ways to allow all nodes to 
> survive if they come up far apart in time (i.e., father apart than 
> dc-deadtime):
> 
> 1.  Increase dc-deadtime.  Unfortunately, the cluster always waits for 
> dc-deadtime to expire before starting resources, so this can delay your 
> cluster's startup.
> 
> 2.  As Ken mentioned, synchronize the starting of Corosync and Pacemaker.  I 
> did this with a simple ExecStartPre systemd script:
> 
> [root@bug0 ~]# cat /etc/systemd/system/corosync.service.d/ha_wait.conf
> [Service]
> ExecStartPre=/sbin/ha_wait.sh
> TimeoutStartSec=11min
> [root@bug0 ~]#
> 
> where ha_wait.sh has something like:
> 
> #!/bin/bash
> 
> timeout=600
> 
> peer=<hostname of HA peer>
> 
> echo "Waiting for ${peer}"
> peerup() {
>   systemctl -H ${peer} show -p ActiveState corosync.service 2> /dev/null | \
>     egrep -q "=active|=reloading|=failed|=activating|=deactivating" && return > 0
>   return 1
> }
> 
> now=${SECONDS}
> while ! peerup && [ $((SECONDS-now)) -lt ${timeout} ]; do
>   echo -n .
>   sleep 5
> done
> 
> peerup && echo "${peer} is up starting HA" || echo "${peer} not up after 
> ${timeout} starting HA alone"
> 
> 
> This will cause corosync startup to block for 10 minutes waiting for the 
> partner node to come up, after which both nodes will start corosync/pacemaker 
> close in time.  If one node never comes up, then it will wait 10 minutes 
> before starting, after which the other node will be fenced (startup fencing 
> and subsequent resource startup will only happen will only occur if 
> no-quorum-policy is set to ignore)
> 
> HTH,
> 
> Chris
> 
> On 12/17/18 6:25 PM, Vitaly Zolotusky wrote:
> 
> Ken, Thank you very much for quick response!
> I do have "two_node: 1" in the corosync.conf. I have attached it to this 
> email (not from the same system as original messages, but they are all the 
> same).
> Syncing startup of corosync and pacemaker on different nodes would be a 
> problem for us.
> I suspect that the problem is that corosync assumes quorum is reached as soon 
> as corosync is started on both nodes, but pacemaker does not abort fencing 
> until pacemaker starts on the other node.
> 
> I will try to work around this issue by moving corosync and pacemaker 
> startups on single node as close to each other as possible.
> Thanks again!
> _Vitaly
> 
> 
> 
> On December 17, 2018 at 6:01 PM Ken Gaillot 
> <[email protected]><mailto:[email protected]> wrote:
> 
> 
> On Mon, 2018-12-17 at 15:43 -0500, Vitaly Zolotusky wrote:
> 
> 
> Hello,
> I have a 2 node cluster and stonith is configured for SBD and
> fence_ipmilan.
> fence_ipmilan for node 1 is configured for 0 delay and for node 2 for
> 30 sec delay so that nodes do not start killing each other during
> startup.
> 
> 
> 
> If you're using corosync 2 or later, you can set "two_node: 1" in
> corosync.conf. That implies the wait_for_all option, so that at start-
> up, both nodes must be present before quorum can be reached the first
> time. (After that point, one node can go away and quorum will be
> retained.)
> 
> Another way to avoid this is to start corosync on all nodes, then start
> pacemaker on all nodes.
> 
> 
> 
> In some cases (usually right after installation and when node 1 comes
> up first and node 2 second) the node that comes up first (node 1)
> states that node 2 is unclean, but can't fence it until quorum
> reached.
> Then as soon as quorum is reached after startup of corosync on node 2
> it sends a fence request for node 2.
> Fence_ipmilan gets into 30 sec delay.
> Pacemaker gets started on node 2.
> While fence_ipmilan is still waiting for the delay node 1 crmd aborts
> transition that requested the fence.
> Even though the transition was aborted, when delay time expires node
> 2 gets fenced.
> 
> 
> 
> Currently, pacemaker has no way of cancelling fencing once it's been
> initiated. Technically, it would be possible to cancel an operation
> that's in the delay stage (assuming that no other fence device has
> already been attempted, if there are more than one), but that hasn't
> been implemented.
> 
> 
> 
> Excerpts from messages are below. I also attached messages from both
> nodes and pe-input files from node 1.
> Any suggestions would be appreciated.
> Thank you very much for your help!
> Vitaly Zolotusky
> 
> Here are excerpts from the messages:
> 
> Node 1 - controller - rhino66-right 172.18.51.81 - came up
> first  *****************
> 
> Nov 29 16:47:54 rhino66-right pengine[22183]:  warning: Fencing and
> resource management disabled due to lack of quorum
> Nov 29 16:47:54 rhino66-right pengine[22183]:  warning: Node rhino66-
> left.lab.archivas.com is unclean!
> Nov 29 16:47:54 rhino66-right pengine[22183]:   notice: Cannot fence
> unclean nodes until quorum is attained (or no-quorum-policy is set to
> ignore)
> .....
> Nov 29 16:48:38 rhino66-right corosync[6677]:   [TOTEM ] A new
> membership (172.16.1.81:60) was formed. Members joined: 2
> Nov 29 16:48:38 rhino66-right corosync[6677]:   [VOTEQ ] Waiting for
> all cluster members. Current votes: 1 expected_votes: 2
> Nov 29 16:48:38 rhino66-right corosync[6677]:   [QUORUM] This node is
> within the primary component and will provide service.
> Nov 29 16:48:38 rhino66-right corosync[6677]:   [QUORUM] Members[2]:
> 1 2
> Nov 29 16:48:38 rhino66-right corosync[6677]:   [MAIN  ] Completed
> service synchronization, ready to provide service.
> Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Quorum acquired
> Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Quorum
> acquired
> Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Could not obtain
> a node name for corosync nodeid 2
> Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Could not
> obtain a node name for corosync nodeid 2
> Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Could not obtain
> a node name for corosync nodeid 2
> Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Node (null)
> state is now member
> Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Could not
> obtain a node name for corosync nodeid 2
> Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Node
> (null) state is now member
> Nov 29 16:48:54 rhino66-right crmd[22184]:   notice: State transition
> S_IDLE -> S_POLICY_ENGINE
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice: Watchdog will
> be used via SBD if fencing is required
> Nov 29 16:48:54 rhino66-right pengine[22183]:  warning: Scheduling
> Node rhino66-left.lab.archivas.com for STONITH
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Fence
> (reboot) rhino66-left.lab.archivas.com 'node is unclean'
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  *
> Start      fence_sbd             ( rhino66-right.lab.archivas.com )
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  *
> Start      ipmi-82               ( rhino66-right.lab.archivas.com )
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  *
> Start      S_IP                  ( rhino66-right.lab.archivas.com )
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  *
> Start      postgres:0            ( rhino66-right.lab.archivas.com )
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  *
> Start      ethmonitor:0          ( rhino66-right.lab.archivas.com )
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  *
> Start      fs_monitor:0          ( rhino66-right.lab.archivas.com
> )   due to unrunnable DBMaster running (blocked)
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  *
> Start      tomcat-instance:0     ( rhino66-right.lab.archivas.com
> )   due to unrunnable DBMaster running (blocked)
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  *
> Start      ClusterMonitor:0      ( rhino66-right.lab.archivas.com
> )   due to unrunnable DBMaster running (blocked)
> Nov 29 16:48:54 rhino66-right pengine[22183]:  warning: Calculated
> transition 5 (with warnings), saving inputs in
> /var/lib/pacemaker/pengine/pe-warn-9.bz2
> 
> Nov 29 16:48:54 rhino66-right crmd[22184]:   notice: Requesting
> fencing (reboot) of node rhino66-left.lab.archivas.com
> Nov 29 16:48:54 rhino66-right stonith-ng[22178]:   notice: Client
> crmd.22184.4aa20c13 wants to fence (reboot) 'rhino66-
> left.lab.archivas.com' with device '(any)'
> Nov 29 16:48:54 rhino66-right stonith-ng[22178]:   notice: Requesting
> peer fencing (reboot) of rhino66-left.lab.archivas.com
> Nov 29 16:48:54 rhino66-right stonith-ng[22178]:   notice: fence_sbd
> can fence (reboot) rhino66-left.lab.archivas.com: static-list
> Nov 29 16:48:54 rhino66-right stonith-ng[22178]:   notice: ipmi-82
> can fence (reboot) rhino66-left.lab.archivas.com: static-list
> Nov 29 16:48:54 rhino66-right /fence_ipmilan: Delay 30 second(s)
> before logging in to the fence device
> .....
> Nov 29 16:49:14 rhino66-right cib[22177]:   notice: Node (null) state
> is now member
> Nov 29 16:49:15 rhino66-right crmd[22184]:   notice: Could not obtain
> a node name for corosync nodeid 2
> Nov 29 16:49:16 rhino66-right crmd[22184]:   notice: Transition
> aborted: Node join
> 
> Nov 29 16:49:20 rhino66-right cib[22177]:   notice: Local CIB
> 1.33.2.abc5436abfebbac946f69a2776e7a73a differs from rhino66-
> left.lab.archivas.com: 1.31.0.62aaa721579d5d8189b01b400534dc05
> 0x5568eb66c590
> Nov 29 16:49:25 rhino66-right /fence_ipmilan: Executing:
> /usr/bin/ipmitool -I lanplus -H 172.16.1.2 -p 623 -U admin -P [set]
> -L ADMINISTRATOR chassis power off
> 
> Node 2 - rhino66-left 172.18.51.82 - comes up second ********
> 
> Nov 29 16:48:38 rhino66-left systemd[1]: Starting Corosync Cluster
> Engine...
> Nov 29 16:48:38 rhino66-left corosync[6217]:   [MAIN  ] Corosync
> Cluster Engine ('2.4.4'): started and ready to provide service.
> Nov 29 16:48:38 rhino66-left corosync[6217]:   [MAIN  ] Corosync
> built-in features: dbus rdma systemd xmlconf qdevices qnetd snmp
> libcgroup pie relro bindnow
> Nov 29 16:48:38 rhino66-left corosync[6217]:   [MAIN  ] interface
> section bindnetaddr is used together with nodelist. Nodelist one is
> going to be used.
> Nov 29 16:48:38 rhino66-left corosync[6217]:   [MAIN  ] Please
> migrate config file to nodelist.
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] Initializing
> transport (UDP/IP Unicast).
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] Initializing
> transmit/receive security (NSS) crypto: aes256 hash: sha256
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] The network
> interface [172.16.1.82] is now up.
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service
> engine loaded: corosync configuration map access [0]
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name:
> cmap
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service
> engine loaded: corosync configuration service [1]
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name:
> cfg
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service
> engine loaded: corosync cluster closed process group service v1.01
> [2]
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name:
> cpg
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service
> engine loaded: corosync profile loading service [4]
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QUORUM] Using quorum
> provider corosync_votequorum
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [VOTEQ ] Waiting for
> all cluster members. Current votes: 1 expected_votes: 2
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service
> engine loaded: corosync vote quorum service v1.0 [5]
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name:
> votequorum
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service
> engine loaded: corosync cluster quorum service v0.1 [3]
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name:
> quorum
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] adding new
> UDPU member {172.16.1.81}
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] adding new
> UDPU member {172.16.1.82}
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] A new
> membership (172.16.1.82:56) was formed. Members joined: 2
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [VOTEQ ] Waiting for
> all cluster members. Current votes: 1 expected_votes: 2
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [VOTEQ ] Waiting for
> all cluster members. Current votes: 1 expected_votes: 2
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [VOTEQ ] Waiting for
> all cluster members. Current votes: 1 expected_votes: 2
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QUORUM] Members[1]: 2
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [MAIN  ] Completed
> service synchronization, ready to provide service.
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] A new
> membership (172.16.1.81:60) was formed. Members joined: 1
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QUORUM] This node is
> within the primary component and will provide service.
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QUORUM] Members[2]: 1
> 2
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [MAIN  ] Completed
> service synchronization, ready to provide service.
> Nov 29 16:48:38 rhino66-left corosync[6207]: Starting Corosync
> Cluster Engine (corosync): [  OK  ]
> Nov 29 16:48:38 rhino66-left systemd[1]: Started Corosync Cluster
> Engine.
> ......
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Starting
> Pacemaker 1.1.18-2.fc28.1
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Could not
> obtain a node name for corosync nodeid 2
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Quorum
> acquired
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Defaulting
> to uname -n for the local corosync node name
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Could not
> obtain a node name for corosync nodeid 1
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Could not
> obtain a node name for corosync nodeid 1
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Node (null)
> state is now member
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Node
> rhino66-left.lab.archivas.com state is now member
> Nov 29 16:49:14 rhino66-left stonith-ng[21176]:   notice: Connecting
> to cluster infrastructure: corosync
> Nov 29 16:49:14 rhino66-left attrd[21178]:   notice: Connecting to
> cluster infrastructure: corosync
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Could not
> obtain a node name for corosync nodeid 1
> Nov 29 16:49:14 rhino66-left cib[21175]:   notice: Connecting to
> cluster infrastructure: corosync
> Nov 29 16:49:14 rhino66-left stonith-ng[21176]:   notice: Could not
> obtain a node name for corosync nodeid 2
> Nov 29 16:49:14 rhino66-left stonith-ng[21176]:   notice: Node (null)
> state is now member
> Nov 29 16:49:14 rhino66-left covermon.sh[21154]: 2018-11-29
> 16:49:14,464 INFO     at covermon.py line 49 [MainThread:21154] New
> enclosure discovered: sgDev=/dev/sg7 serial=SGFTJ18263C8B07
> Nov 29 16:49:14 rhino66-left attrd[21178]:   notice: Could not obtain
> a node name for corosync nodeid 2
> Nov 29 16:49:14 rhino66-left attrd[21178]:   notice: Node (null)
> state is now member
> ......
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Connecting to
> cluster infrastructure: corosync
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Could not obtain
> a node name for corosync nodeid 2
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Defaulting to
> uname -n for the local corosync node name
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Quorum acquired
> Nov 29 16:49:15 rhino66-left cib[21175]:   notice: Defaulting to
> uname -n for the local corosync node name
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Could not obtain
> a node name for corosync nodeid 1
> Nov 29 16:49:15 rhino66-left attrd[21178]:   notice: Could not obtain
> a node name for corosync nodeid 1
> Nov 29 16:49:15 rhino66-left stonith-ng[21176]:   notice: Could not
> obtain a node name for corosync nodeid 1
> Nov 29 16:49:15 rhino66-left attrd[21178]:   notice: Node (null)
> state is now member
> Nov 29 16:49:15 rhino66-left stonith-ng[21176]:   notice: Node (null)
> state is now member
> Nov 29 16:49:15 rhino66-left stonith-ng[21176]:   notice: Watchdog
> will be used via SBD if fencing is required
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Could not obtain
> a node name for corosync nodeid 1
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Node (null) state
> is now member
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Node rhino66-
> left.lab.archivas.com state is now member
> Nov 29 16:49:15 rhino66-left attrd[21178]:   notice: Defaulting to
> uname -n for the local corosync node name
> Nov 29 16:49:15 rhino66-left attrd[21178]:   notice: Recorded
> attribute writer: rhino66-right.lab.archivas.com
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Defaulting to
> uname -n for the local corosync node name
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: The local CRM is
> operational
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: State transition
> S_STARTING -> S_PENDING
> Nov 29 16:49:16 rhino66-left stonith-ng[21176]:   notice: Added
> 'fence_sbd' to the device list (1 active devices)
> Nov 29 16:49:16 rhino66-left crmd[21180]:   notice: Could not obtain
> a node name for corosync nodeid 1
> Nov 29 16:49:17 rhino66-left stonith-ng[21176]:   notice: Added
> 'ipmi-81' to the device list (2 active devices)
> Nov 29 16:49:20 rhino66-left attrd[21178]:   notice: Updating all
> attributes after cib_refresh_notify event
> 
> ......
> 
> Nov 29 16:49:27 rhino66-left systemd-logind[1539]: System is powering
> down.
> _______________________________________________
> Users mailing list: [email protected]<mailto:[email protected]>
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> _______________________________________________
> Users mailing list: [email protected]<mailto:[email protected]>
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> _______________________________________________
> Users mailing list: [email protected]<mailto:[email protected]>
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Users mailing list: [email protected]
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users


Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.

Reply via email to