Hi,
I have an Active/Passive configuration with a drbd mast/slave resource:
MDA1PFP-S01 14:40:27 1803 0 ~ # pcs status
Cluster name: MDA1PFP
Last updated: Fri Sep 16 14:41:18 2016 Last change: Fri Sep 16 14:39:49
2016 by root via cibadmin on MDA1PFP-PCS01
Stack: corosync
Current DC: MDA1PFP-PCS02 (version 1.1.13-10.el7-44eb2dd) - partition with
quorum
2 nodes and 7 resources configured
Online: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]
Full list of resources:
Master/Slave Set: drbd1_sync [drbd1]
Masters: [ MDA1PFP-PCS02 ]
Slaves: [ MDA1PFP-PCS01 ]
mda-ip (ocf::heartbeat:IPaddr2): Started MDA1PFP-PCS02
Clone Set: ping-clone [ping]
Started: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]
ACTIVE (ocf::heartbeat:Dummy): Started MDA1PFP-PCS02
shared_fs (ocf::heartbeat:Filesystem): Started MDA1PFP-PCS02
PCSD Status:
MDA1PFP-PCS01: Online
MDA1PFP-PCS02: Online
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
MDA1PFP-S01 14:41:19 1804 0 ~ # pcs resource --full
Master: drbd1_sync
Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true
Resource: drbd1 (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=shared_fs
Operations: start interval=0s timeout=240 (drbd1-start-interval-0s)
promote interval=0s timeout=90 (drbd1-promote-interval-0s)
demote interval=0s timeout=90 (drbd1-demote-interval-0s)
stop interval=0s timeout=100 (drbd1-stop-interval-0s)
monitor interval=60s (drbd1-monitor-interval-60s)
Resource: mda-ip (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=192.168.120.20 cidr_netmask=32 nic=bond0
Operations: start interval=0s timeout=20s (mda-ip-start-interval-0s)
stop interval=0s timeout=20s (mda-ip-stop-interval-0s)
monitor interval=1s (mda-ip-monitor-interval-1s)
Clone: ping-clone
Resource: ping (class=ocf provider=pacemaker type=ping)
Attributes: dampen=5s multiplier=1000 host_list=pf-pep-dev-1 timeout=1
attempts=3
Operations: start interval=0s timeout=60 (ping-start-interval-0s)
stop interval=0s timeout=20 (ping-stop-interval-0s)
monitor interval=1 (ping-monitor-interval-1)
Resource: ACTIVE (class=ocf provider=heartbeat type=Dummy)
Operations: start interval=0s timeout=20 (ACTIVE-start-interval-0s)
stop interval=0s timeout=20 (ACTIVE-stop-interval-0s)
monitor interval=10 timeout=20 (ACTIVE-monitor-interval-10)
Resource: shared_fs (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd1 directory=/shared_fs fstype=xfs
Operations: start interval=0s timeout=60 (shared_fs-start-interval-0s)
stop interval=0s timeout=60 (shared_fs-stop-interval-0s)
monitor interval=20 timeout=40 (shared_fs-monitor-interval-20)
MDA1PFP-S01 14:41:35 1805 0 ~ # pcs constraint --full
Location Constraints:
Resource: mda-ip
Enabled on: MDA1PFP-PCS01 (score:50) (id:location-mda-ip-MDA1PFP-PCS01-50)
Constraint: location-mda-ip
Rule: score=-INFINITY boolean-op=or (id:location-mda-ip-rule)
Expression: pingd lt 1 (id:location-mda-ip-rule-expr)
Expression: not_defined pingd (id:location-mda-ip-rule-expr-1)
Ordering Constraints:
start ping-clone then start mda-ip (kind:Optional)
(id:order-ping-clone-mda-ip-Optional)
promote drbd1_sync then start shared_fs (kind:Mandatory)
(id:order-drbd1_sync-shared_fs-mandatory)
Colocation Constraints:
ACTIVE with mda-ip (score:INFINITY) (id:colocation-ACTIVE-mda-ip-INFINITY)
drbd1_sync with mda-ip (score:INFINITY) (rsc-role:Master)
(with-rsc-role:Started) (id:colocation-drbd1_sync-mda-ip-INFINITY)
shared_fs with drbd1_sync (score:INFINITY) (rsc-role:Started)
(with-rsc-role:Master) (id:colocation-shared_fs-drbd1_sync-INFINITY)
The cluster starts fine, except resources starting not on the preferred host. I
asked this in a different question to keep things separated.
The status after starting is:
Last updated: Fri Sep 16 14:39:57 2016 Last change: Fri Sep 16
14:39:49 2016 by root via cibadmin on MDA1PFP-PCS01
Stack: corosync
Current DC: MDA1PFP-PCS02 (version 1.1.13-10.el7-44eb2dd) - partition with
quorum
2 nodes and 7 resources configured
Online: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]
Master/Slave Set: drbd1_sync [drbd1]
Masters: [ MDA1PFP-PCS02 ]
Slaves: [ MDA1PFP-PCS01 ]
mda-ip (ocf::heartbeat:IPaddr2): Started MDA1PFP-PCS02
Clone Set: ping-clone [ping]
Started: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]
ACTIVE (ocf::heartbeat:Dummy): Started MDA1PFP-PCS02
shared_fs (ocf::heartbeat:Filesystem): Started MDA1PFP-PCS02
>From this state, I did two tests to simulate a cluster failover:
1. Shutdown the cluster node with the master with pcs cluster stop
2. Disable the network device for the virtual ip with ifdown and wait until
ping detects it
In both cases, the failover is executed but the drbd is not promoted to master
on the new active node:
Last updated: Fri Sep 16 14:43:33 2016 Last change: Fri Sep 16
14:43:31 2016 by root via cibadmin on MDA1PFP-PCS01
Stack: corosync
Current DC: MDA1PFP-PCS01 (version 1.1.13-10.el7-44eb2dd) - partition with
quorum
2 nodes and 7 resources configured
Online: [ MDA1PFP-PCS01 ]
OFFLINE: [ MDA1PFP-PCS02 ]
Master/Slave Set: drbd1_sync [drbd1]
Slaves: [ MDA1PFP-PCS01 ]
mda-ip (ocf::heartbeat:IPaddr2): Started MDA1PFP-PCS01
Clone Set: ping-clone [ping]
Started: [ MDA1PFP-PCS01 ]
ACTIVE (ocf::heartbeat:Dummy): Started MDA1PFP-PCS01
I was able to trace this to the fencing in the drbd configuration
MDA1PFP-S01 14:41:44 1806 0 ~ # cat /etc/drbd.d/shared_fs.res
resource shared_fs {
disk /dev/mapper/rhel_mdaf--pf--pep--1-drbd;
disk {
fencing resource-only;
}
handlers {
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
device /dev/drbd1;
meta-disk internal;
on MDA1PFP-S01 {
address 192.168.123.10:7789;
}
on MDA1PFP-S02 {
address 192.168.123.11:7789;
}
}
If I disable the fencing scripts everything works as expected. If enabled, no
node is promoted to master after failover. I captured the complete log from
/var/log/messages from cluster start to failover if that helps:
MDA1PFP-S01 14:48:37 1807 0 ~ # cat /var/log/messages
Sep 16 14:40:16 MDA1PFP-S01 rsyslogd: [origin software="rsyslogd"
swVersion="7.4.7" x-pid="13857" x-info="http://www.rsyslog.com"] start
Sep 16 14:40:16 MDA1PFP-S01 rsyslogd-2221: module 'imuxsock' already in this
config, cannot be added
[try http://www.rsyslog.com/e/2221 ]
Sep 16 14:40:16 MDA1PFP-S01 systemd: Stopping System Logging Service...
Sep 16 14:40:16 MDA1PFP-S01 systemd: Starting System Logging Service...
Sep 16 14:40:16 MDA1PFP-S01 systemd: Started System Logging Service.
Sep 16 14:40:27 MDA1PFP-S01 systemd: Started Corosync Cluster Engine.
Sep 16 14:40:27 MDA1PFP-S01 systemd: Started Pacemaker High Availability
Cluster Manager.
Sep 16 14:43:30 MDA1PFP-S01 crmd[13130]: notice: Operation ACTIVE_start_0: ok
(node=MDA1PFP-PCS01, call=33, rc=0, cib-update=22, confirmed=true)
Sep 16 14:43:30 MDA1PFP-S01 crmd[13130]: notice: Operation drbd1_notify_0: ok
(node=MDA1PFP-PCS01, call=32, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:30 MDA1PFP-S01 IPaddr2(mda-ip)[15321]: INFO: Adding inet address
192.168.120.20/32 with broadcast address 192.168.120.255 to device bond0
Sep 16 14:43:30 MDA1PFP-S01 avahi-daemon[912]: Registering new address record
for 192.168.120.20 on bond0.IPv4.
Sep 16 14:43:30 MDA1PFP-S01 IPaddr2(mda-ip)[15321]: INFO: Bringing device bond0
up
Sep 16 14:43:30 MDA1PFP-S01 kernel: block drbd1: peer( Primary -> Secondary )
Sep 16 14:43:30 MDA1PFP-S01 IPaddr2(mda-ip)[15321]: INFO:
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p
/var/run/resource-agents/send_arp-192.168.120.20 bond0 192.168.120.20 auto
not_used not_used
Sep 16 14:43:30 MDA1PFP-S01 crmd[13130]: notice: Operation mda-ip_start_0: ok
(node=MDA1PFP-PCS01, call=35, rc=0, cib-update=24, confirmed=true)
Sep 16 14:43:30 MDA1PFP-S01 crmd[13130]: notice: Operation drbd1_notify_0: ok
(node=MDA1PFP-PCS01, call=36, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:30 MDA1PFP-S01 crmd[13130]: notice: Operation drbd1_notify_0: ok
(node=MDA1PFP-PCS01, call=38, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:30 MDA1PFP-S01 kernel: drbd shared_fs: peer( Secondary -> Unknown
) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
Sep 16 14:43:30 MDA1PFP-S01 kernel: drbd shared_fs: ack_receiver terminated
Sep 16 14:43:30 MDA1PFP-S01 kernel: drbd shared_fs: Terminating drbd_a_shared_f
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: Connection closed
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: conn( TearDown ->
Unconnected )
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: receiver terminated
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: Restarting receiver thread
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: receiver (re)started
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: conn( Unconnected ->
WFConnection )
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Operation drbd1_notify_0: ok
(node=MDA1PFP-PCS01, call=39, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Operation drbd1_notify_0: ok
(node=MDA1PFP-PCS01, call=40, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: helper command:
/sbin/drbdadm fence-peer shared_fs
Sep 16 14:43:31 MDA1PFP-S01 crm-fence-peer.sh[15569]: invoked for shared_fs
Sep 16 14:43:31 MDA1PFP-S01 crm-fence-peer.sh[15569]: INFO peer is not
reachable, my disk is UpToDate: placed constraint
'drbd-fence-by-handler-shared_fs-drbd1_sync'
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: helper command:
/sbin/drbdadm fence-peer shared_fs exit code 5 (0x500)
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: fence-peer helper returned
5 (peer is unreachable, assumed to be dead)
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: pdsk( DUnknown -> Outdated )
Sep 16 14:43:31 MDA1PFP-S01 kernel: block drbd1: role( Secondary -> Primary )
Sep 16 14:43:31 MDA1PFP-S01 kernel: block drbd1: new current UUID
B1FC3E9C008711DD:C02542C7B26F9B28:BCC6102B1FD69768:BCC5102B1FD69768
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: error: pcmkRegisterNode: Triggered
assert at xml.c:594 : node->type == XML_ELEMENT_NODE
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Operation drbd1_promote_0: ok
(node=MDA1PFP-PCS01, call=41, rc=0, cib-update=26, confirmed=true)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Operation drbd1_notify_0: ok
(node=MDA1PFP-PCS01, call=42, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Our peer on the DC
(MDA1PFP-PCS02) is dead
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: State transition S_NOT_DC ->
S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK
origin=peer_update_callback ]
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: State transition S_ELECTION
-> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED
origin=election_timeout_popped ]
Sep 16 14:43:31 MDA1PFP-S01 attrd[13128]: notice: crm_update_peer_proc: Node
MDA1PFP-PCS02[2] - state is now lost (was member)
Sep 16 14:43:31 MDA1PFP-S01 attrd[13128]: notice: Removing all MDA1PFP-PCS02
attributes for attrd_peer_change_cb
Sep 16 14:43:31 MDA1PFP-S01 attrd[13128]: notice: Lost attribute writer
MDA1PFP-PCS02
Sep 16 14:43:31 MDA1PFP-S01 attrd[13128]: notice: Removing MDA1PFP-PCS02/2
from the membership list
Sep 16 14:43:31 MDA1PFP-S01 attrd[13128]: notice: Purged 1 peers with id=2
and/or uname=MDA1PFP-PCS02 from the membership cache
Sep 16 14:43:31 MDA1PFP-S01 stonith-ng[13125]: notice: crm_update_peer_proc:
Node MDA1PFP-PCS02[2] - state is now lost (was member)
Sep 16 14:43:31 MDA1PFP-S01 stonith-ng[13125]: notice: Removing
MDA1PFP-PCS02/2 from the membership list
Sep 16 14:43:31 MDA1PFP-S01 stonith-ng[13125]: notice: Purged 1 peers with
id=2 and/or uname=MDA1PFP-PCS02 from the membership cache
Sep 16 14:43:31 MDA1PFP-S01 cib[13124]: notice: crm_update_peer_proc: Node
MDA1PFP-PCS02[2] - state is now lost (was member)
Sep 16 14:43:31 MDA1PFP-S01 cib[13124]: notice: Removing MDA1PFP-PCS02/2 from
the membership list
Sep 16 14:43:31 MDA1PFP-S01 cib[13124]: notice: Purged 1 peers with id=2
and/or uname=MDA1PFP-PCS02 from the membership cache
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: warning: FSA: Input I_ELECTION_DC from
do_election_check() received in state S_INTEGRATION
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Notifications disabled
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: error: pcmkRegisterNode: Triggered
assert at xml.c:594 : node->type == XML_ELEMENT_NODE
Sep 16 14:43:31 MDA1PFP-S01 pengine[13129]: notice: On loss of CCM Quorum:
Ignore
Sep 16 14:43:31 MDA1PFP-S01 pengine[13129]: notice: Demote drbd1:0 (Master
-> Slave MDA1PFP-PCS01)
Sep 16 14:43:31 MDA1PFP-S01 pengine[13129]: notice: Calculated Transition 0:
/var/lib/pacemaker/pengine/pe-input-414.bz2
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Initiating action 55: notify
drbd1_pre_notify_demote_0 on MDA1PFP-PCS01 (local)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Operation drbd1_notify_0: ok
(node=MDA1PFP-PCS01, call=43, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Initiating action 8: demote
drbd1_demote_0 on MDA1PFP-PCS01 (local)
Sep 16 14:43:31 MDA1PFP-S01 systemd-udevd: error: /dev/drbd1: Wrong medium type
Sep 16 14:43:31 MDA1PFP-S01 kernel: block drbd1: role( Primary -> Secondary )
Sep 16 14:43:31 MDA1PFP-S01 kernel: block drbd1: bitmap WRITE of 0 pages took 0
jiffies
Sep 16 14:43:31 MDA1PFP-S01 kernel: block drbd1: 0 KB (0 bits) marked
out-of-sync by on disk bit-map.
Sep 16 14:43:31 MDA1PFP-S01 systemd-udevd: error: /dev/drbd1: Wrong medium type
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: error: pcmkRegisterNode: Triggered
assert at xml.c:594 : node->type == XML_ELEMENT_NODE
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Operation drbd1_demote_0: ok
(node=MDA1PFP-PCS01, call=44, rc=0, cib-update=49, confirmed=true)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Initiating action 56: notify
drbd1_post_notify_demote_0 on MDA1PFP-PCS01 (local)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Operation drbd1_notify_0: ok
(node=MDA1PFP-PCS01, call=45, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Initiating action 10: monitor
drbd1_monitor_60000 on MDA1PFP-PCS01 (local)
Sep 16 14:43:31 MDA1PFP-S01 corosync[13019]: [TOTEM ] A new membership
(192.168.121.10:988) was formed. Members left: 2
Sep 16 14:43:31 MDA1PFP-S01 corosync[13019]: [QUORUM] Members[1]: 1
Sep 16 14:43:31 MDA1PFP-S01 corosync[13019]: [MAIN ] Completed service
synchronization, ready to provide service.
Sep 16 14:43:31 MDA1PFP-S01 pacemakerd[13113]: notice: crm_reap_unseen_nodes:
Node MDA1PFP-PCS02[2] - state is now lost (was member)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: crm_reap_unseen_nodes: Node
MDA1PFP-PCS02[2] - state is now lost (was member)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: warning: No match for shutdown action
on 2
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Stonith/shutdown of
MDA1PFP-PCS02 not matched
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Transition aborted: Node
failure (source=peer_update_callback:252, 0)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: error: pcmkRegisterNode: Triggered
assert at xml.c:594 : node->type == XML_ELEMENT_NODE
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Transition 0 (Complete=10,
Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-414.bz2): Complete
Sep 16 14:43:31 MDA1PFP-S01 pengine[13129]: notice: On loss of CCM Quorum:
Ignore
Sep 16 14:43:31 MDA1PFP-S01 pengine[13129]: notice: Calculated Transition 1:
/var/lib/pacemaker/pengine/pe-input-415.bz2
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: Transition 1 (Complete=0,
Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-415.bz2): Complete
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: notice: State transition
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
origin=notify_crmd ]
Sep 16 14:48:48 MDA1PFP-S01 chronyd[909]: Source 62.116.162.126 replaced with
46.182.19.75
Any help appreciated,
Jens
--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
[email protected]<mailto:[email protected]>
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter
de.cgi.com/pflichtangaben<http://de.cgi.com/pflichtangaben>.
CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI
Group Inc. and its affiliates may be contained in this message. If you are not
a recipient indicated or intended in this message (or responsible for delivery
of this message to such person), or you think for any reason that this message
may have been addressed to you in error, you may not use or copy or deliver
this message to anyone else. In such case, you should destroy this message and
are asked to notify the sender by reply e-mail.
_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org