Re: [ClusterLabs] Trying to prevent instantaneous Failover / Failback at standby reconnect

Andrei Borzenkov Tue, 24 Jul 2018 11:09:28 -0700

24.07.2018 20:59, O'Donovan, Garret пишет:
> Hello, and thank you for adding me to the list.
> 
> We are using Pacemaker in a two-node hot-warm redundancy configuration.  Both 
> nodes run ocf:pacemaker:ping (cloned) to monitor a ping group of devices.  
> The nodes share a virtual IP using ocf:heartbeat:IPAddr2.  Our applications 
> run in either Primary mode (does all the work and sends status updates to its 
> peer), or Standby mode (receives status updates and is ready to take over if 
> current primary fails).  We have constraints set up so that IPAddr2 fails 
> over on any failure Pacemaker detects (including ping group failure), and the 
> applications follow it.
> 
> This works great for most use cases, but we have issues in one test case 
> where we disconnect the node in standby (by yanking the eth cable) for about 
> 30 seconds to a minute, and then reconnect it.  The problem is that Pacemaker 
> seems to put the Primary into standby for a very short time while 
> reconnecting the two nodes, and then making it primary again.
> 
> Is there any way to prevent Pacemaker from doing this? Detailed config info 
> and log file snippet are below.
> 
> - Regards
> - Garret O'Donovan
> 
> 
> PLATFORM
> This is all running this on CentOS 7 
> (centos-release-7-4.1708.el7.centos.x86_64) on VM’s (VMware ESXi 5.5).  The 
> two nodes are hosted on physically different servers.
> 
> VERSION INFO
> corosync-2.4.3-2.el7_5.1.x86_64.rpm
> pacemaker-1.1.18-11.el7_5.2.x86_64.rpm
> pcs-0.9.162-5.el7.centos.1.x86_64.rpm
> resource-agents-3.9.5-124.el7.x86_64.rpm
> 
> PACEMAKER CONFIGURATION
> [root@DVTVM0302 ~]# pcs config show
> Cluster Name: vendor1
> Corosync Nodes:
>  dvtvm0302 dvtvm0303
> Pacemaker Nodes:
>  dvtvm0302 dvtvm0303
> 
> Resources:
>  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: cidr_netmask=24 ip=10.144.101.210 nic=eth0
>   Operations: monitor interval=1s (ClusterIP-monitor-interval-1s)
>               start interval=0s timeout=20s (ClusterIP-start-interval-0s)
>               stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
>  Resource: application (class=ocf provider=vendor type=application)
>   Operations: migrate_from interval=0s timeout=20 
> (application-migrate_from-interval-0s)
>               migrate_to interval=0s timeout=20 
> (application-migrate_to-interval-0s)
>               monitor interval=1s (application-monitor-interval-1s)
>               reload interval=0s timeout=20 (application-reload-interval-0s)
>               start interval=0s timeout=20 (application-start-interval-0s)
>               stop interval=0s timeout=20 (application-stop-interval-0s)
>  Clone: Connected-clone
>   Meta Attrs: interleave=true
>   Resource: Connected (class=ocf provider=pacemaker type=ping)
>    Attributes: attempts=2 dampen=1s debug=true host_list="10.10.24.5 
> 10.10.24.18" multiplier=1000
>    Operations: monitor interval=3s timeout=10 (Connected-monitor-interval-3s)
>                start interval=0 timeout=3 (Connected-start-interval-0)
>                stop interval=0s timeout=20 (Connected-stop-interval-0s)
> 
> Stonith Devices:


You are risking real split brain here.

> Fencing Levels:
> 
> Location Constraints:
>   Resource: ClusterIP
>     Constraint: location-ClusterIP
>       Rule: boolean-op=or score=-INFINITY  (id:location-ClusterIP-rule)
>         Expression: pingd lt 500  (id:location-ClusterIP-rule-expr)
>         Expression: not_defined pingd  (id:location-ClusterIP-rule-expr-1)
> Ordering Constraints:
>   start ClusterIP then start application (kind:Mandatory)
> Colocation Constraints:
>   ClusterIP with application (score:INFINITY)
> Ticket Constraints:
> 
> Alerts:
>  No alerts defined
> 
> Resources Defaults:
>  migration-threshold: 1
>  failure-timeout: 5s
>  cluster-recheck-interval: 5s
>  resource-stickiness: INFINITY
> Operations Defaults:
>  No defaults set
> 
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: vendor1
>  dc-version: 1.1.18-11.el7_5.2-2b07d5c5a9
>  have-watchdog: false
>  no-quorum-policy: ignore
>  stonith-enabled: false
> 
> Quorum:
>   Options:
> 
> COROSYNC CONFIG FILE:
> 
> [root@DVTVM0302 corosync]# cat corosync.conf
> totem {
>     version: 2
>     cluster_name: vendor1
>     secauth: off
>     transport: udpu
> }
> 
> nodelist {
>     node {
>         ring0_addr: dvtvm0302
>         nodeid: 1
>     }
> 
>     node {
>         ring0_addr: dvtvm0303
>         nodeid: 2
>     }
> }
> 
> quorum {
>     provider: corosync_votequorum
>     two_node: 1
> }
> 
> logging {
>     to_logfile: yes
>     logfile: /var/log/cluster/corosync.log
>     to_syslog: yes
> }
> 
> 
> LOGFILE:
> /var/log/cluster/corosync.log
> 

Logs from another node are probably needed too - it becomes DC after
reconnect and so takes all decisions. Logs may contain reasons why it
decides to stop resources.
...
> Jul 20 07:46:49 [1569] DVTVM0302.mse.am.mot.com       crmd:  warning: 
> crmd_ha_msg_filter:    Another DC detected: dvtvm0303 (op=noop)
> Jul 20 07:46:49 [1569] DVTVM0302.mse.am.mot.com       crmd:   notice: 
> do_state_transition:    State transition S_IDLE -> S_ELECTION | 
> input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter
> Jul 20 07:46:49 [1569] DVTVM0302.mse.am.mot.com       crmd:     info: 
> update_dc:    Unset DC. Was dvtvm0302
> Jul 20 07:46:49 [1569] DVTVM0302.mse.am.mot.com       crmd:     info: 
> election_count_vote:    Election 6 (owner: 2) lost: vote from dvtvm0303 
> (Uptime)
> Jul 20 07:46:49 [1569] DVTVM0302.mse.am.mot.com       crmd:   notice: 
> do_state_transition:    State transition S_ELECTION -> S_RELEASE_DC | 
> input=I_RELEASE_DC cause=C_FSA_INTERNAL origin=do_election_count_vote
> Jul 20 07:46:49 [1569] DVTVM0302.mse.am.mot.com       crmd:     info: 
> do_dc_release:    DC role released
...
_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Trying to prevent instantaneous Failover / Failback at standby reconnect

Reply via email to