24.07.2018 20:59, O'Donovan, Garret пишет:
> Hello, and thank you for adding me to the list.
>
> We are using Pacemaker in a two-node hot-warm redundancy configuration. Both
> nodes run ocf:pacemaker:ping (cloned) to monitor a ping group of devices.
> The nodes share a virtual IP using ocf:heartbeat:IPAddr2. Our applications
> run in either Primary mode (does all the work and sends status updates to its
> peer), or Standby mode (receives status updates and is ready to take over if
> current primary fails). We have constraints set up so that IPAddr2 fails
> over on any failure Pacemaker detects (including ping group failure), and the
> applications follow it.
>
> This works great for most use cases, but we have issues in one test case
> where we disconnect the node in standby (by yanking the eth cable) for about
> 30 seconds to a minute, and then reconnect it. The problem is that Pacemaker
> seems to put the Primary into standby for a very short time while
> reconnecting the two nodes, and then making it primary again.
>
> Is there any way to prevent Pacemaker from doing this? Detailed config info
> and log file snippet are below.
>
> - Regards
> - Garret O'Donovan
>
>
> PLATFORM
> This is all running this on CentOS 7
> (centos-release-7-4.1708.el7.centos.x86_64) on VM’s (VMware ESXi 5.5). The
> two nodes are hosted on physically different servers.
>
> VERSION INFO
> corosync-2.4.3-2.el7_5.1.x86_64.rpm
> pacemaker-1.1.18-11.el7_5.2.x86_64.rpm
> pcs-0.9.162-5.el7.centos.1.x86_64.rpm
> resource-agents-3.9.5-124.el7.x86_64.rpm
>
> PACEMAKER CONFIGURATION
> [root@DVTVM0302 ~]# pcs config show
> Cluster Name: vendor1
> Corosync Nodes:
> dvtvm0302 dvtvm0303
> Pacemaker Nodes:
> dvtvm0302 dvtvm0303
>
> Resources:
> Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
> Attributes: cidr_netmask=24 ip=10.144.101.210 nic=eth0
> Operations: monitor interval=1s (ClusterIP-monitor-interval-1s)
> start interval=0s timeout=20s (ClusterIP-start-interval-0s)
> stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
> Resource: application (class=ocf provider=vendor type=application)
> Operations: migrate_from interval=0s timeout=20
> (application-migrate_from-interval-0s)
> migrate_to interval=0s timeout=20
> (application-migrate_to-interval-0s)
> monitor interval=1s (application-monitor-interval-1s)
> reload interval=0s timeout=20 (application-reload-interval-0s)
> start interval=0s timeout=20 (application-start-interval-0s)
> stop interval=0s timeout=20 (application-stop-interval-0s)
> Clone: Connected-clone
> Meta Attrs: interleave=true
> Resource: Connected (class=ocf provider=pacemaker type=ping)
> Attributes: attempts=2 dampen=1s debug=true host_list="10.10.24.5
> 10.10.24.18" multiplier=1000
> Operations: monitor interval=3s timeout=10 (Connected-monitor-interval-3s)
> start interval=0 timeout=3 (Connected-start-interval-0)
> stop interval=0s timeout=20 (Connected-stop-interval-0s)
>
> Stonith Devices:
You are risking real split brain here.
> Fencing Levels:
>
> Location Constraints:
> Resource: ClusterIP
> Constraint: location-ClusterIP
> Rule: boolean-op=or score=-INFINITY (id:location-ClusterIP-rule)
> Expression: pingd lt 500 (id:location-ClusterIP-rule-expr)
> Expression: not_defined pingd (id:location-ClusterIP-rule-expr-1)
> Ordering Constraints:
> start ClusterIP then start application (kind:Mandatory)
> Colocation Constraints:
> ClusterIP with application (score:INFINITY)
> Ticket Constraints:
>
> Alerts:
> No alerts defined
>
> Resources Defaults:
> migration-threshold: 1
> failure-timeout: 5s
> cluster-recheck-interval: 5s
> resource-stickiness: INFINITY
> Operations Defaults:
> No defaults set
>
> Cluster Properties:
> cluster-infrastructure: corosync
> cluster-name: vendor1
> dc-version: 1.1.18-11.el7_5.2-2b07d5c5a9
> have-watchdog: false
> no-quorum-policy: ignore
> stonith-enabled: false
>
> Quorum:
> Options:
>
> COROSYNC CONFIG FILE:
>
> [root@DVTVM0302 corosync]# cat corosync.conf
> totem {
> version: 2
> cluster_name: vendor1
> secauth: off
> transport: udpu
> }
>
> nodelist {
> node {
> ring0_addr: dvtvm0302
> nodeid: 1
> }
>
> node {
> ring0_addr: dvtvm0303
> nodeid: 2
> }
> }
>
> quorum {
> provider: corosync_votequorum
> two_node: 1
> }
>
> logging {
> to_logfile: yes
> logfile: /var/log/cluster/corosync.log
> to_syslog: yes
> }
>
>
> LOGFILE:
> /var/log/cluster/corosync.log
>
Logs from another node are probably needed too - it becomes DC after
reconnect and so takes all decisions. Logs may contain reasons why it
decides to stop resources.
...
> Jul 20 07:46:49 [1569] DVTVM0302.mse.am.mot.com crmd: warning:
> crmd_ha_msg_filter: Another DC detected: dvtvm0303 (op=noop)
> Jul 20 07:46:49 [1569] DVTVM0302.mse.am.mot.com crmd: notice:
> do_state_transition: State transition S_IDLE -> S_ELECTION |
> input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter
> Jul 20 07:46:49 [1569] DVTVM0302.mse.am.mot.com crmd: info:
> update_dc: Unset DC. Was dvtvm0302
> Jul 20 07:46:49 [1569] DVTVM0302.mse.am.mot.com crmd: info:
> election_count_vote: Election 6 (owner: 2) lost: vote from dvtvm0303
> (Uptime)
> Jul 20 07:46:49 [1569] DVTVM0302.mse.am.mot.com crmd: notice:
> do_state_transition: State transition S_ELECTION -> S_RELEASE_DC |
> input=I_RELEASE_DC cause=C_FSA_INTERNAL origin=do_election_count_vote
> Jul 20 07:46:49 [1569] DVTVM0302.mse.am.mot.com crmd: info:
> do_dc_release: DC role released
...
_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org