[ClusterLabs] (no subject)

Auer, Jens Fri, 16 Sep 2016 08:04:08 -0700

Hi,

I have an Active/Passive configuration with a drbd mast/slave resource:


MDA1PFP-S01 14:40:27 1803 0 ~ # pcs status
Cluster name: MDA1PFP
Last updated: Fri Sep 16 14:41:18 2016        Last change: Fri Sep 16 14:39:49 
2016 by root via cibadmin on MDA1PFP-PCS01
Stack: corosync
Current DC: MDA1PFP-PCS02 (version 1.1.13-10.el7-44eb2dd) - partition with 
quorum
2 nodes and 7 resources configured

Online: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]

Full list of resources:

 Master/Slave Set: drbd1_sync [drbd1]
     Masters: [ MDA1PFP-PCS02 ]
     Slaves: [ MDA1PFP-PCS01 ]
 mda-ip    (ocf::heartbeat:IPaddr2):    Started MDA1PFP-PCS02
 Clone Set: ping-clone [ping]
     Started: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]
 ACTIVE    (ocf::heartbeat:Dummy):    Started MDA1PFP-PCS02
 shared_fs    (ocf::heartbeat:Filesystem):    Started MDA1PFP-PCS02

PCSD Status:
  MDA1PFP-PCS01: Online
  MDA1PFP-PCS02: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

MDA1PFP-S01 14:41:19 1804 0 ~ # pcs resource --full
 Master: drbd1_sync
  Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 
notify=true
  Resource: drbd1 (class=ocf provider=linbit type=drbd)
   Attributes: drbd_resource=shared_fs
   Operations: start interval=0s timeout=240 (drbd1-start-interval-0s)
               promote interval=0s timeout=90 (drbd1-promote-interval-0s)
               demote interval=0s timeout=90 (drbd1-demote-interval-0s)
               stop interval=0s timeout=100 (drbd1-stop-interval-0s)
               monitor interval=60s (drbd1-monitor-interval-60s)
 Resource: mda-ip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=192.168.120.20 cidr_netmask=32 nic=bond0
  Operations: start interval=0s timeout=20s (mda-ip-start-interval-0s)
              stop interval=0s timeout=20s (mda-ip-stop-interval-0s)
              monitor interval=1s (mda-ip-monitor-interval-1s)
 Clone: ping-clone
  Resource: ping (class=ocf provider=pacemaker type=ping)
   Attributes: dampen=5s multiplier=1000 host_list=pf-pep-dev-1 timeout=1 
attempts=3
   Operations: start interval=0s timeout=60 (ping-start-interval-0s)
               stop interval=0s timeout=20 (ping-stop-interval-0s)
               monitor interval=1 (ping-monitor-interval-1)
 Resource: ACTIVE (class=ocf provider=heartbeat type=Dummy)
  Operations: start interval=0s timeout=20 (ACTIVE-start-interval-0s)
              stop interval=0s timeout=20 (ACTIVE-stop-interval-0s)
              monitor interval=10 timeout=20 (ACTIVE-monitor-interval-10)
 Resource: shared_fs (class=ocf provider=heartbeat type=Filesystem)
  Attributes: device=/dev/drbd1 directory=/shared_fs fstype=xfs
  Operations: start interval=0s timeout=60 (shared_fs-start-interval-0s)
              stop interval=0s timeout=60 (shared_fs-stop-interval-0s)
              monitor interval=20 timeout=40 (shared_fs-monitor-interval-20)

MDA1PFP-S01 14:41:35 1805 0 ~ # pcs constraint --full
Location Constraints:
  Resource: mda-ip
    Enabled on: MDA1PFP-PCS01 (score:50) (id:location-mda-ip-MDA1PFP-PCS01-50)
    Constraint: location-mda-ip
      Rule: score=-INFINITY boolean-op=or  (id:location-mda-ip-rule)
        Expression: pingd lt 1  (id:location-mda-ip-rule-expr)
        Expression: not_defined pingd  (id:location-mda-ip-rule-expr-1)
Ordering Constraints:
  start ping-clone then start mda-ip (kind:Optional) 
(id:order-ping-clone-mda-ip-Optional)
  promote drbd1_sync then start shared_fs (kind:Mandatory) 
(id:order-drbd1_sync-shared_fs-mandatory)
Colocation Constraints:
  ACTIVE with mda-ip (score:INFINITY) (id:colocation-ACTIVE-mda-ip-INFINITY)
  drbd1_sync with mda-ip (score:INFINITY) (rsc-role:Master) 
(with-rsc-role:Started) (id:colocation-drbd1_sync-mda-ip-INFINITY)
  shared_fs with drbd1_sync (score:INFINITY) (rsc-role:Started) 
(with-rsc-role:Master) (id:colocation-shared_fs-drbd1_sync-INFINITY)

The cluster starts fine, except resources starting not on the preferred host. I 
asked this in a different question to keep things separated.
The status after starting is:
Last updated: Fri Sep 16 14:39:57 2016          Last change: Fri Sep 16 
14:39:49 2016 by root via cibadmin on MDA1PFP-PCS01
Stack: corosync
Current DC: MDA1PFP-PCS02 (version 1.1.13-10.el7-44eb2dd) - partition with 
quorum
2 nodes and 7 resources configured

Online: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]

 Master/Slave Set: drbd1_sync [drbd1]
     Masters: [ MDA1PFP-PCS02 ]
     Slaves: [ MDA1PFP-PCS01 ]
mda-ip  (ocf::heartbeat:IPaddr2):    Started MDA1PFP-PCS02
 Clone Set: ping-clone [ping]
     Started: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]
ACTIVE  (ocf::heartbeat:Dummy): Started MDA1PFP-PCS02
shared_fs    (ocf::heartbeat:Filesystem):    Started MDA1PFP-PCS02

>From this state, I did two tests to simulate a cluster failover:
1. Shutdown the cluster node with the master with pcs cluster stop
2. Disable the network device for the virtual ip with ifdown and wait until 
ping detects it

In both cases, the failover is executed but the drbd is not promoted to master 
on the new active node:
Last updated: Fri Sep 16 14:43:33 2016          Last change: Fri Sep 16 
14:43:31 2016 by root via cibadmin on MDA1PFP-PCS01
Stack: corosync
Current DC: MDA1PFP-PCS01 (version 1.1.13-10.el7-44eb2dd) - partition with 
quorum
2 nodes and 7 resources configured

Online: [ MDA1PFP-PCS01 ]
OFFLINE: [ MDA1PFP-PCS02 ]

 Master/Slave Set: drbd1_sync [drbd1]
     Slaves: [ MDA1PFP-PCS01 ]
mda-ip  (ocf::heartbeat:IPaddr2):    Started MDA1PFP-PCS01
 Clone Set: ping-clone [ping]
     Started: [ MDA1PFP-PCS01 ]
ACTIVE  (ocf::heartbeat:Dummy): Started MDA1PFP-PCS01

I was able to trace this to the fencing in the drbd configuration
MDA1PFP-S01 14:41:44 1806 0 ~ # cat /etc/drbd.d/shared_fs.res
resource shared_fs {
disk    /dev/mapper/rhel_mdaf--pf--pep--1-drbd;
  disk {
    fencing resource-only;
  }
  handlers {
    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
  }
    device    /dev/drbd1;
    meta-disk internal;
    on MDA1PFP-S01 {
        address 192.168.123.10:7789;
    }
    on MDA1PFP-S02 {
        address 192.168.123.11:7789;
    }
}

If I disable the fencing scripts everything works as expected. If enabled, no 
node is promoted to master after failover. I captured the complete log from 
/var/log/messages from cluster start to failover if that helps:
MDA1PFP-S01 14:48:37 1807 0 ~ # cat /var/log/messages
Sep 16 14:40:16 MDA1PFP-S01 rsyslogd: [origin software="rsyslogd" 
swVersion="7.4.7" x-pid="13857" x-info="http://www.rsyslog.com";] start
Sep 16 14:40:16 MDA1PFP-S01 rsyslogd-2221: module 'imuxsock' already in this 
config, cannot be added
 [try http://www.rsyslog.com/e/2221 ]
Sep 16 14:40:16 MDA1PFP-S01 systemd: Stopping System Logging Service...
Sep 16 14:40:16 MDA1PFP-S01 systemd: Starting System Logging Service...
Sep 16 14:40:16 MDA1PFP-S01 systemd: Started System Logging Service.
Sep 16 14:40:27 MDA1PFP-S01 systemd: Started Corosync Cluster Engine.
Sep 16 14:40:27 MDA1PFP-S01 systemd: Started Pacemaker High Availability 
Cluster Manager.
Sep 16 14:43:30 MDA1PFP-S01 crmd[13130]:  notice: Operation ACTIVE_start_0: ok 
(node=MDA1PFP-PCS01, call=33, rc=0, cib-update=22, confirmed=true)
Sep 16 14:43:30 MDA1PFP-S01 crmd[13130]:  notice: Operation drbd1_notify_0: ok 
(node=MDA1PFP-PCS01, call=32, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:30 MDA1PFP-S01 IPaddr2(mda-ip)[15321]: INFO: Adding inet address 
192.168.120.20/32 with broadcast address 192.168.120.255 to device bond0
Sep 16 14:43:30 MDA1PFP-S01 avahi-daemon[912]: Registering new address record 
for 192.168.120.20 on bond0.IPv4.
Sep 16 14:43:30 MDA1PFP-S01 IPaddr2(mda-ip)[15321]: INFO: Bringing device bond0 
up
Sep 16 14:43:30 MDA1PFP-S01 kernel: block drbd1: peer( Primary -> Secondary )
Sep 16 14:43:30 MDA1PFP-S01 IPaddr2(mda-ip)[15321]: INFO: 
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p 
/var/run/resource-agents/send_arp-192.168.120.20 bond0 192.168.120.20 auto 
not_used not_used
Sep 16 14:43:30 MDA1PFP-S01 crmd[13130]:  notice: Operation mda-ip_start_0: ok 
(node=MDA1PFP-PCS01, call=35, rc=0, cib-update=24, confirmed=true)
Sep 16 14:43:30 MDA1PFP-S01 crmd[13130]:  notice: Operation drbd1_notify_0: ok 
(node=MDA1PFP-PCS01, call=36, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:30 MDA1PFP-S01 crmd[13130]:  notice: Operation drbd1_notify_0: ok 
(node=MDA1PFP-PCS01, call=38, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:30 MDA1PFP-S01 kernel: drbd shared_fs: peer( Secondary -> Unknown 
) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
Sep 16 14:43:30 MDA1PFP-S01 kernel: drbd shared_fs: ack_receiver terminated
Sep 16 14:43:30 MDA1PFP-S01 kernel: drbd shared_fs: Terminating drbd_a_shared_f
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: Connection closed
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: conn( TearDown -> 
Unconnected )
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: receiver terminated
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: Restarting receiver thread
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: receiver (re)started
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: conn( Unconnected -> 
WFConnection )
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Operation drbd1_notify_0: ok 
(node=MDA1PFP-PCS01, call=39, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Operation drbd1_notify_0: ok 
(node=MDA1PFP-PCS01, call=40, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: helper command: 
/sbin/drbdadm fence-peer shared_fs
Sep 16 14:43:31 MDA1PFP-S01 crm-fence-peer.sh[15569]: invoked for shared_fs
Sep 16 14:43:31 MDA1PFP-S01 crm-fence-peer.sh[15569]: INFO peer is not 
reachable, my disk is UpToDate: placed constraint 
'drbd-fence-by-handler-shared_fs-drbd1_sync'
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: helper command: 
/sbin/drbdadm fence-peer shared_fs exit code 5 (0x500)
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: fence-peer helper returned 
5 (peer is unreachable, assumed to be dead)
Sep 16 14:43:31 MDA1PFP-S01 kernel: drbd shared_fs: pdsk( DUnknown -> Outdated )
Sep 16 14:43:31 MDA1PFP-S01 kernel: block drbd1: role( Secondary -> Primary )
Sep 16 14:43:31 MDA1PFP-S01 kernel: block drbd1: new current UUID 
B1FC3E9C008711DD:C02542C7B26F9B28:BCC6102B1FD69768:BCC5102B1FD69768
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:   error: pcmkRegisterNode: Triggered 
assert at xml.c:594 : node->type == XML_ELEMENT_NODE
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Operation drbd1_promote_0: ok 
(node=MDA1PFP-PCS01, call=41, rc=0, cib-update=26, confirmed=true)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Operation drbd1_notify_0: ok 
(node=MDA1PFP-PCS01, call=42, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Our peer on the DC 
(MDA1PFP-PCS02) is dead
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: State transition S_NOT_DC -> 
S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK 
origin=peer_update_callback ]
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: State transition S_ELECTION 
-> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED 
origin=election_timeout_popped ]
Sep 16 14:43:31 MDA1PFP-S01 attrd[13128]:  notice: crm_update_peer_proc: Node 
MDA1PFP-PCS02[2] - state is now lost (was member)
Sep 16 14:43:31 MDA1PFP-S01 attrd[13128]:  notice: Removing all MDA1PFP-PCS02 
attributes for attrd_peer_change_cb
Sep 16 14:43:31 MDA1PFP-S01 attrd[13128]:  notice: Lost attribute writer 
MDA1PFP-PCS02
Sep 16 14:43:31 MDA1PFP-S01 attrd[13128]:  notice: Removing MDA1PFP-PCS02/2 
from the membership list
Sep 16 14:43:31 MDA1PFP-S01 attrd[13128]:  notice: Purged 1 peers with id=2 
and/or uname=MDA1PFP-PCS02 from the membership cache
Sep 16 14:43:31 MDA1PFP-S01 stonith-ng[13125]:  notice: crm_update_peer_proc: 
Node MDA1PFP-PCS02[2] - state is now lost (was member)
Sep 16 14:43:31 MDA1PFP-S01 stonith-ng[13125]:  notice: Removing 
MDA1PFP-PCS02/2 from the membership list
Sep 16 14:43:31 MDA1PFP-S01 stonith-ng[13125]:  notice: Purged 1 peers with 
id=2 and/or uname=MDA1PFP-PCS02 from the membership cache
Sep 16 14:43:31 MDA1PFP-S01 cib[13124]:  notice: crm_update_peer_proc: Node 
MDA1PFP-PCS02[2] - state is now lost (was member)
Sep 16 14:43:31 MDA1PFP-S01 cib[13124]:  notice: Removing MDA1PFP-PCS02/2 from 
the membership list
Sep 16 14:43:31 MDA1PFP-S01 cib[13124]:  notice: Purged 1 peers with id=2 
and/or uname=MDA1PFP-PCS02 from the membership cache
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: warning: FSA: Input I_ELECTION_DC from 
do_election_check() received in state S_INTEGRATION
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Notifications disabled
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:   error: pcmkRegisterNode: Triggered 
assert at xml.c:594 : node->type == XML_ELEMENT_NODE
Sep 16 14:43:31 MDA1PFP-S01 pengine[13129]:  notice: On loss of CCM Quorum: 
Ignore
Sep 16 14:43:31 MDA1PFP-S01 pengine[13129]:  notice: Demote  drbd1:0    (Master 
-> Slave MDA1PFP-PCS01)
Sep 16 14:43:31 MDA1PFP-S01 pengine[13129]:  notice: Calculated Transition 0: 
/var/lib/pacemaker/pengine/pe-input-414.bz2
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Initiating action 55: notify 
drbd1_pre_notify_demote_0 on MDA1PFP-PCS01 (local)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Operation drbd1_notify_0: ok 
(node=MDA1PFP-PCS01, call=43, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Initiating action 8: demote 
drbd1_demote_0 on MDA1PFP-PCS01 (local)
Sep 16 14:43:31 MDA1PFP-S01 systemd-udevd: error: /dev/drbd1: Wrong medium type
Sep 16 14:43:31 MDA1PFP-S01 kernel: block drbd1: role( Primary -> Secondary )
Sep 16 14:43:31 MDA1PFP-S01 kernel: block drbd1: bitmap WRITE of 0 pages took 0 
jiffies
Sep 16 14:43:31 MDA1PFP-S01 kernel: block drbd1: 0 KB (0 bits) marked 
out-of-sync by on disk bit-map.
Sep 16 14:43:31 MDA1PFP-S01 systemd-udevd: error: /dev/drbd1: Wrong medium type
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:   error: pcmkRegisterNode: Triggered 
assert at xml.c:594 : node->type == XML_ELEMENT_NODE
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Operation drbd1_demote_0: ok 
(node=MDA1PFP-PCS01, call=44, rc=0, cib-update=49, confirmed=true)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Initiating action 56: notify 
drbd1_post_notify_demote_0 on MDA1PFP-PCS01 (local)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Operation drbd1_notify_0: ok 
(node=MDA1PFP-PCS01, call=45, rc=0, cib-update=0, confirmed=true)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Initiating action 10: monitor 
drbd1_monitor_60000 on MDA1PFP-PCS01 (local)
Sep 16 14:43:31 MDA1PFP-S01 corosync[13019]: [TOTEM ] A new membership 
(192.168.121.10:988) was formed. Members left: 2
Sep 16 14:43:31 MDA1PFP-S01 corosync[13019]: [QUORUM] Members[1]: 1
Sep 16 14:43:31 MDA1PFP-S01 corosync[13019]: [MAIN  ] Completed service 
synchronization, ready to provide service.
Sep 16 14:43:31 MDA1PFP-S01 pacemakerd[13113]:  notice: crm_reap_unseen_nodes: 
Node MDA1PFP-PCS02[2] - state is now lost (was member)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: crm_reap_unseen_nodes: Node 
MDA1PFP-PCS02[2] - state is now lost (was member)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]: warning: No match for shutdown action 
on 2
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Stonith/shutdown of 
MDA1PFP-PCS02 not matched
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Transition aborted: Node 
failure (source=peer_update_callback:252, 0)
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:   error: pcmkRegisterNode: Triggered 
assert at xml.c:594 : node->type == XML_ELEMENT_NODE
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Transition 0 (Complete=10, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-414.bz2): Complete
Sep 16 14:43:31 MDA1PFP-S01 pengine[13129]:  notice: On loss of CCM Quorum: 
Ignore
Sep 16 14:43:31 MDA1PFP-S01 pengine[13129]:  notice: Calculated Transition 1: 
/var/lib/pacemaker/pengine/pe-input-415.bz2
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: Transition 1 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-415.bz2): Complete
Sep 16 14:43:31 MDA1PFP-S01 crmd[13130]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
origin=notify_crmd ]
Sep 16 14:48:48 MDA1PFP-S01 chronyd[909]: Source 62.116.162.126 replaced with 
46.182.19.75

Any help appreciated,
  Jens


--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
[email protected]<mailto:[email protected]>
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben<http://de.cgi.com/pflichtangaben>.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.

_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] (no subject)

Reply via email to