Hi
I have a basic 2 node active/passive cluster with Pacemaker (1.1.14 , pcs:
0.9.148) / CMAN (3.0.12.1) / Corosync (1.4.7) on RHEL 6.8.
This cluster runs NFS on top of DRBD (8.4.4).
Basically the system is working on both nodes and I can switch the resources
from one node to the other.
But switching resources to the other node does not work, if I try to move just
one resource and have the others follow due to the location constraints.
>From the logged messages I see that in this "failure case" there is NO attempt
>to demote/promote the DRBD clone resource.
Here is my setup:
==================================================================
Cluster Name: clst1
Corosync Nodes:
ventsi-clst1-sync ventsi-clst2-sync
Pacemaker Nodes:
ventsi-clst1-sync ventsi-clst2-sync
Resources:
Resource: IPaddrNFS (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=xxx.xxx.xxx.xxx cidr_netmask=24
Operations: start interval=0s timeout=20s (IPaddrNFS-start-interval-0s)
stop interval=0s timeout=20s (IPaddrNFS-stop-interval-0s)
monitor interval=5s (IPaddrNFS-monitor-interval-5s)
Resource: NFSServer (class=ocf provider=heartbeat type=nfsserver)
Attributes: nfs_shared_infodir=/var/lib/nfsserversettings/
nfs_ip=xxx.xxx.xxx.xxx nfsd_args="-H xxx.xxx.xxx.xxx"
Operations: start interval=0s timeout=40 (NFSServer-start-interval-0s)
stop interval=0s timeout=20s (NFSServer-stop-interval-0s)
monitor interval=10s timeout=20s (NFSServer-monitor-interval-10s)
Master: DRBDClone
Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true
Resource: DRBD (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=nfsdata
Operations: start interval=0s timeout=240 (DRBD-start-interval-0s)
promote interval=0s timeout=90 (DRBD-promote-interval-0s)
demote interval=0s timeout=90 (DRBD-demote-interval-0s)
stop interval=0s timeout=100 (DRBD-stop-interval-0s)
monitor interval=1s timeout=5 (DRBD-monitor-interval-1s)
Resource: DRBD_global_clst (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd1 directory=/drbdmnts/global_clst fstype=ext4
Operations: start interval=0s timeout=60 (DRBD_global_clst-start-interval-0s)
stop interval=0s timeout=60 (DRBD_global_clst-stop-interval-0s)
monitor interval=20 timeout=40
(DRBD_global_clst-monitor-interval-20)
Stonith Devices:
Resource: ipmi-fence-clst1 (class=stonith type=fence_ipmilan)
Attributes: lanplus=1 login=foo passwd=bar action=reboot
ipaddr=yyy.yyy.yyy.yyy pcmk_host_check=static-list
pcmk_host_list=ventsi-clst1-sync auth=password timeout=30 cipher=1
Operations: monitor interval=60s (ipmi-fence-clst1-monitor-interval-60s)
Resource: ipmi-fence-clst2 (class=stonith type=fence_ipmilan)
Attributes: lanplus=1 login=foo passwd=bar action=reboot
ipaddr=zzz.zzz.zzz.zzz pcmk_host_check=static-list
pcmk_host_list=ventsi-clst2-sync auth=password timeout=30 cipher=1
Operations: monitor interval=60s (ipmi-fence-clst2-monitor-interval-60s)
Fencing Levels:
Location Constraints:
Resource: ipmi-fence-clst1
Disabled on: ventsi-clst1-sync (score:-INFINITY)
(id:location-ipmi-fence-clst1-ventsi-clst1-sync--INFINITY)
Resource: ipmi-fence-clst2
Disabled on: ventsi-clst2-sync (score:-INFINITY)
(id:location-ipmi-fence-clst2-ventsi-clst2-sync--INFINITY)
Ordering Constraints:
start IPaddrNFS then start NFSServer (kind:Mandatory)
(id:order-IPaddrNFS-NFSServer-mandatory)
promote DRBDClone then start DRBD_global_clst (kind:Mandatory)
(id:order-DRBDClone-DRBD_global_clst-mandatory)
start DRBD_global_clst then start IPaddrNFS (kind:Mandatory)
(id:order-DRBD_global_clst-IPaddrNFS-mandatory)
Colocation Constraints:
NFSServer with IPaddrNFS (score:INFINITY)
(id:colocation-NFSServer-IPaddrNFS-INFINITY)
DRBD_global_clst with DRBDClone (score:INFINITY)
(id:colocation-DRBD_global_clst-DRBDClone-INFINITY)
IPaddrNFS with DRBD_global_clst (score:INFINITY)
(id:colocation-IPaddrNFS-DRBD_global_clst-INFINITY)
Resources Defaults:
resource-stickiness: INFINITY
Operations Defaults:
timeout: 10s
Cluster Properties:
cluster-infrastructure: cman
dc-version: 1.1.14-8.el6-70404b0
have-watchdog: false
last-lrm-refresh: 1478277432
no-quorum-policy: ignore
stonith-enabled: true
symmetric-cluster: true
==================================================================
Initial state is e.g. this (all resources at node1):
Online: [ ventsi-clst1-sync ventsi-clst2-sync ]
Full list of resources:
ipmi-fence-clst1 (stonith:fence_ipmilan): Started
ventsi-clst2-sync
ipmi-fence-clst2 (stonith:fence_ipmilan): Started
ventsi-clst1-sync
IPaddrNFS (ocf::heartbeat:IPaddr2): Started ventsi-clst1-sync
NFSServer (ocf::heartbeat:nfsserver): Started ventsi-clst1-sync
Master/Slave Set: DRBDClone [DRBD]
Masters: [ ventsi-clst1-sync ]
Slaves: [ ventsi-clst2-sync ]
DRBD_global_clst (ocf::heartbeat:Filesystem): Started
ventsi-clst1-sync
==================================================================
If I shutdown the cluster at node 1 ('pcs cluster stop') or if I move the DRBD
clone resource ('pcs resource move DRBDClone') all resources switch
successfully to node2.
I.e. the demote/promote of the DRBD clone resource is working in these cases.
But if I try to move any other resource (e.g. 'pcs resource move NFSServer')
the resources NFSServer, IPaddrNFS and DRBD_global_clst are stopped at node 1,
but then already follows starting of the DRBD_global_clst resource at node2,
which fails due to the missing demote/promote.
As far as I can see there is some follow-up attempt to repair things partially
as the resources are started again at node1 exclusive the resource which I
moved due to my move command.
Final state is like this:
Online: [ ventsi-clst1-sync ventsi-clst2-sync ]
Full list of resources:
ipmi-fence-clst1 (stonith:fence_ipmilan): Started
ventsi-clst2-sync
ipmi-fence-clst2 (stonith:fence_ipmilan): Started
ventsi-clst1-sync
IPaddrNFS (ocf::heartbeat:IPaddr2): Started ventsi-clst1-sync
NFSServer (ocf::heartbeat:nfsserver): Stopped
Master/Slave Set: DRBDClone [DRBD]
Masters: [ ventsi-clst1-sync ]
Slaves: [ ventsi-clst2-sync ]
DRBD_global_clst (ocf::heartbeat:Filesystem): Started
ventsi-clst1-sync
Failed Actions:
* DRBD_global_clst_start_0 on ventsi-clst2-sync 'unknown error' (1): call=778,
status=complete, exitreason='none',
last-rc-change='Fri Nov 4 19:32:56 2016', queued=0ms, exec=43ms
==================================================================
Here are the logged messages for this "failure case":
2016-11-04T19:32:55.163982+01:00 ventsi-clst1 crmd[6116]: notice: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
2016-11-04T19:32:55.168100+01:00 ventsi-clst1 pengine[6115]: notice: On loss
of CCM Quorum: Ignore
2016-11-04T19:32:55.181252+01:00 ventsi-clst1 pengine[6115]: notice: Move
IPaddrNFS#011(Started ventsi-clst1-sync -> ventsi-clst2-sync)
2016-11-04T19:32:55.181260+01:00 ventsi-clst1 pengine[6115]: notice: Move
NFSServer#011(Started ventsi-clst1-sync -> ventsi-clst2-sync)
2016-11-04T19:32:55.181278+01:00 ventsi-clst1 pengine[6115]: notice: Move
DRBD_global_clst#011(Started ventsi-clst1-sync -> ventsi-clst2-sync) <=== here
no demote/promote is listed
2016-11-04T19:32:55.182385+01:00 ventsi-clst1 pengine[6115]: notice:
Calculated Transition 202: /var/lib/pacemaker/pengine/pe-input-766.bz2
2016-11-04T19:32:55.182998+01:00 ventsi-clst1 crmd[6116]: notice: Initiating
action 15: stop NFSServer_stop_0 on ventsi-clst1-sync (local)
2016-11-04T19:32:55.196265+01:00 ventsi-clst1 nfsserver(NFSServer)[15978]:
INFO: Stopping NFS server ...
2016-11-04T19:32:55.249137+01:00 ventsi-clst1 kernel: nfsd: last server has
exited, flushing export cache
2016-11-04T19:32:55.252241+01:00 ventsi-clst1 rpc.mountd[15282]: Caught signal
15, un-registering and exiting.
2016-11-04T19:32:55.632708+01:00 ventsi-clst1 nfsserver(NFSServer)[15978]:
INFO: Stopping sm-notify
2016-11-04T19:32:55.650552+01:00 ventsi-clst1 nfsserver(NFSServer)[15978]:
INFO: Stopping rpc.statd
2016-11-04T19:32:55.666777+01:00 ventsi-clst1 rpc.statd[15243]: Caught signal
15, un-registering and exiting
2016-11-04T19:32:56.692819+01:00 ventsi-clst1 nfsserver(NFSServer)[15978]:
INFO: NFS server stopped
2016-11-04T19:32:56.695523+01:00 ventsi-clst1 crmd[6116]: notice: Operation
NFSServer_stop_0: ok (node=ventsi-clst1-sync, call=1220, rc=0, cib-update=1695,
confirmed=true)
2016-11-04T19:32:56.696243+01:00 ventsi-clst1 crmd[6116]: notice: Initiating
action 12: stop IPaddrNFS_stop_0 on ventsi-clst1-sync (local)
2016-11-04T19:32:56.727882+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16108]: INFO:
IP status = ok, IP_CIP=
2016-11-04T19:32:56.733383+01:00 ventsi-clst1 crmd[6116]: notice: Operation
IPaddrNFS_stop_0: ok (node=ventsi-clst1-sync, call=1222, rc=0, cib-update=1696,
confirmed=true)
2016-11-04T19:32:56.733917+01:00 ventsi-clst1 crmd[6116]: notice: Initiating
action 48: stop DRBD_global_clst_stop_0 on ventsi-clst1-sync (local)
2016-11-04T19:32:56.757181+01:00 ventsi-clst1
Filesystem(DRBD_global_clst)[16163]: INFO: Running stop for /dev/drbd1 on
/drbdmnts/global_clst
2016-11-04T19:32:56.764684+01:00 ventsi-clst1
Filesystem(DRBD_global_clst)[16163]: INFO: Trying to unmount
/drbdmnts/global_clst
2016-11-04T19:32:56.771260+01:00 ventsi-clst1
Filesystem(DRBD_global_clst)[16163]: INFO: unmounted /drbdmnts/global_clst
successfully
2016-11-04T19:32:56.776640+01:00 ventsi-clst1 crmd[6116]: notice: Operation
DRBD_global_clst_stop_0: ok (node=ventsi-clst1-sync, call=1224, rc=0,
cib-update=1697, confirmed=true)
2016-11-04T19:32:56.777140+01:00 ventsi-clst1 crmd[6116]: notice: Initiating
action 49: start DRBD_global_clst_start_0 on ventsi-clst2-sync <=== here is
the attempt to start the filesystem at the other node, although DRBD has not
yet been promoted
2016-11-04T19:32:56.840137+01:00 ventsi-clst1 crmd[6116]: warning: Action 49
(DRBD_global_clst_start_0) on ventsi-clst2-sync failed (target: 0 vs. rc: 1):
Error
2016-11-04T19:32:56.840158+01:00 ventsi-clst1 crmd[6116]: notice: Transition
aborted by DRBD_global_clst_start_0 'modify' on ventsi-clst2-sync: Event failed
(magic=0:1;49:202:0:b7941532-c74b-40cc-a8ad-27b5502b8fba, cib=0.649.4,
source=match_graph_event:381, 0)
2016-11-04T19:32:56.840232+01:00 ventsi-clst1 crmd[6116]: warning: Action 49
(DRBD_global_clst_start_0) on ventsi-clst2-sync failed (target: 0 vs. rc: 1):
Error
2016-11-04T19:32:56.840328+01:00 ventsi-clst1 crmd[6116]: notice: Transition
202 (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=5,
Source=/var/lib/pacemaker/pengine/pe-input-766.bz2): Complete
2016-11-04T19:32:56.843693+01:00 ventsi-clst1 pengine[6115]: notice: On loss
of CCM Quorum: Ignore
2016-11-04T19:32:56.844072+01:00 ventsi-clst1 pengine[6115]: warning:
Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: unknown
error (1)
2016-11-04T19:32:56.844102+01:00 ventsi-clst1 pengine[6115]: warning:
Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: unknown
error (1)
2016-11-04T19:32:56.845071+01:00 ventsi-clst1 pengine[6115]: notice: Start
IPaddrNFS#011(ventsi-clst2-sync)
2016-11-04T19:32:56.845078+01:00 ventsi-clst1 pengine[6115]: notice: Start
NFSServer#011(ventsi-clst2-sync)
2016-11-04T19:32:56.845081+01:00 ventsi-clst1 pengine[6115]: notice: Demote
DRBD:0#011(Master -> Slave ventsi-clst1-sync) <=== here there would be the
necessary demote/promote ... but it's too late; the start of the filesystem
already failed ...
2016-11-04T19:32:56.845083+01:00 ventsi-clst1 pengine[6115]: notice: Promote
DRBD:1#011(Slave -> Master ventsi-clst2-sync)
2016-11-04T19:32:56.845084+01:00 ventsi-clst1 pengine[6115]: notice: Recover
DRBD_global_clst#011(Started ventsi-clst2-sync)
2016-11-04T19:32:56.847986+01:00 ventsi-clst1 pengine[6115]: notice:
Calculated Transition 203: /var/lib/pacemaker/pengine/pe-input-767.bz2 <===
... so the above transition gets caught by the following attempt to repair
things partially
2016-11-04T19:32:56.867679+01:00 ventsi-clst1 pengine[6115]: notice: On loss
of CCM Quorum: Ignore
2016-11-04T19:32:56.868074+01:00 ventsi-clst1 pengine[6115]: warning:
Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: unknown
error (1)
2016-11-04T19:32:56.868101+01:00 ventsi-clst1 pengine[6115]: warning:
Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: unknown
error (1)
2016-11-04T19:32:56.868287+01:00 ventsi-clst1 pengine[6115]: warning: Forcing
DRBD_global_clst away from ventsi-clst2-sync after 1000000 failures
(max=1000000)
2016-11-04T19:32:56.869011+01:00 ventsi-clst1 pengine[6115]: notice: Start
IPaddrNFS#011(ventsi-clst1-sync)
2016-11-04T19:32:56.869023+01:00 ventsi-clst1 pengine[6115]: notice: Recover
DRBD_global_clst#011(Started ventsi-clst2-sync -> ventsi-clst1-sync)
2016-11-04T19:32:56.869770+01:00 ventsi-clst1 pengine[6115]: notice:
Calculated Transition 204: /var/lib/pacemaker/pengine/pe-input-768.bz2
2016-11-04T19:32:56.870065+01:00 ventsi-clst1 crmd[6116]: notice: Initiating
action 3: stop DRBD_global_clst_stop_0 on ventsi-clst2-sync
2016-11-04T19:32:56.908075+01:00 ventsi-clst1 crmd[6116]: notice: Initiating
action 42: start DRBD_global_clst_start_0 on ventsi-clst1-sync (local)
2016-11-04T19:32:56.931072+01:00 ventsi-clst1
Filesystem(DRBD_global_clst)[16242]: INFO: Running start for /dev/drbd1 on
/drbdmnts/global_clst
2016-11-04T19:32:56.943250+01:00 ventsi-clst1 kernel: EXT4-fs (drbd1): warning:
maximal mount count reached, running e2fsck is recommended
2016-11-04T19:32:56.953253+01:00 ventsi-clst1 kernel: EXT4-fs (drbd1): mounted
filesystem with ordered data mode. Opts:
2016-11-04T19:32:56.964284+01:00 ventsi-clst1 crmd[6116]: notice: Operation
DRBD_global_clst_start_0: ok (node=ventsi-clst1-sync, call=1225, rc=0,
cib-update=1701, confirmed=true)
2016-11-04T19:32:56.965104+01:00 ventsi-clst1 crmd[6116]: notice: Initiating
action 10: start IPaddrNFS_start_0 on ventsi-clst1-sync (local)
2016-11-04T19:32:56.965325+01:00 ventsi-clst1 crmd[6116]: notice: Initiating
action 43: monitor DRBD_global_clst_monitor_20000 on ventsi-clst1-sync (local)
2016-11-04T19:32:56.996235+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16308]: INFO:
Adding inet address xxx.xxx.xxx.xxx/24 with broadcast address xxx.xxx.xxx.255
to device bond0
2016-11-04T19:32:57.002059+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16308]: INFO:
Bringing device bond0 up
2016-11-04T19:32:57.008128+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16308]: INFO:
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p
/var/run/resource-agents/send_arp-xxx.xxx.xxx.xxx bond0 xxx.xxx.xxx.xxx auto
not_used not_used
2016-11-04T19:32:57.020159+01:00 ventsi-clst1 crmd[6116]: notice: Operation
IPaddrNFS_start_0: ok (node=ventsi-clst1-sync, call=1226, rc=0,
cib-update=1703, confirmed=true)
2016-11-04T19:32:57.020901+01:00 ventsi-clst1 crmd[6116]: notice: Initiating
action 11: monitor IPaddrNFS_monitor_5000 on ventsi-clst1-sync (local)
2016-11-04T19:32:57.052231+01:00 ventsi-clst1 crmd[6116]: notice: Transition
204 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-768.bz2): Complete
2016-11-04T19:32:57.052251+01:00 ventsi-clst1 crmd[6116]: notice: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
==================================================================
Any ideas what could be the reason for this behavior?
And how could this be fixed?
(I already found several articles on the internet with the recommendation to
have two separately configured monitor operations for the DRBD resource
configured one for the master role and another one for the slave role.
Already tried this to no avail.)
Regards
Andi
_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org