[ClusterLabs] DRBD demote/promote not called - Why? How to fix?

CART Andreas Fri, 04 Nov 2016 12:03:01 -0700

Hi

I have a basic 2 node active/passive cluster with Pacemaker (1.1.14 , pcs: 
0.9.148) / CMAN (3.0.12.1) / Corosync (1.4.7) on RHEL 6.8.
This cluster runs NFS on top of DRBD (8.4.4).


Basically the system is working on both nodes and I can switch the resources 
from one node to the other.
But switching resources to the other node does not work, if I try to move just 
one resource and have the others follow due to the location constraints.

>From the logged messages I see that in this "failure case" there is NO attempt 
>to demote/promote the DRBD clone resource.

Here is my setup:
==================================================================
Cluster Name: clst1
Corosync Nodes:
 ventsi-clst1-sync ventsi-clst2-sync
Pacemaker Nodes:
 ventsi-clst1-sync ventsi-clst2-sync

Resources:
 Resource: IPaddrNFS (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=xxx.xxx.xxx.xxx cidr_netmask=24
  Operations: start interval=0s timeout=20s (IPaddrNFS-start-interval-0s)
              stop interval=0s timeout=20s (IPaddrNFS-stop-interval-0s)
              monitor interval=5s (IPaddrNFS-monitor-interval-5s)
 Resource: NFSServer (class=ocf provider=heartbeat type=nfsserver)
  Attributes: nfs_shared_infodir=/var/lib/nfsserversettings/ 
nfs_ip=xxx.xxx.xxx.xxx nfsd_args="-H xxx.xxx.xxx.xxx"
  Operations: start interval=0s timeout=40 (NFSServer-start-interval-0s)
              stop interval=0s timeout=20s (NFSServer-stop-interval-0s)
              monitor interval=10s timeout=20s (NFSServer-monitor-interval-10s)
 Master: DRBDClone
  Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 
notify=true
  Resource: DRBD (class=ocf provider=linbit type=drbd)
   Attributes: drbd_resource=nfsdata
   Operations: start interval=0s timeout=240 (DRBD-start-interval-0s)
               promote interval=0s timeout=90 (DRBD-promote-interval-0s)
               demote interval=0s timeout=90 (DRBD-demote-interval-0s)
               stop interval=0s timeout=100 (DRBD-stop-interval-0s)
               monitor interval=1s timeout=5 (DRBD-monitor-interval-1s)
 Resource: DRBD_global_clst (class=ocf provider=heartbeat type=Filesystem)
  Attributes: device=/dev/drbd1 directory=/drbdmnts/global_clst fstype=ext4
  Operations: start interval=0s timeout=60 (DRBD_global_clst-start-interval-0s)
              stop interval=0s timeout=60 (DRBD_global_clst-stop-interval-0s)
              monitor interval=20 timeout=40 
(DRBD_global_clst-monitor-interval-20)

Stonith Devices:
 Resource: ipmi-fence-clst1 (class=stonith type=fence_ipmilan)
  Attributes: lanplus=1 login=foo passwd=bar action=reboot 
ipaddr=yyy.yyy.yyy.yyy pcmk_host_check=static-list 
pcmk_host_list=ventsi-clst1-sync auth=password timeout=30 cipher=1
  Operations: monitor interval=60s (ipmi-fence-clst1-monitor-interval-60s)
 Resource: ipmi-fence-clst2 (class=stonith type=fence_ipmilan)
  Attributes: lanplus=1 login=foo passwd=bar action=reboot 
ipaddr=zzz.zzz.zzz.zzz pcmk_host_check=static-list 
pcmk_host_list=ventsi-clst2-sync auth=password timeout=30 cipher=1
  Operations: monitor interval=60s (ipmi-fence-clst2-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: ipmi-fence-clst1
    Disabled on: ventsi-clst1-sync (score:-INFINITY) 
(id:location-ipmi-fence-clst1-ventsi-clst1-sync--INFINITY)
  Resource: ipmi-fence-clst2
    Disabled on: ventsi-clst2-sync (score:-INFINITY) 
(id:location-ipmi-fence-clst2-ventsi-clst2-sync--INFINITY)
Ordering Constraints:
  start IPaddrNFS then start NFSServer (kind:Mandatory) 
(id:order-IPaddrNFS-NFSServer-mandatory)
  promote DRBDClone then start DRBD_global_clst (kind:Mandatory) 
(id:order-DRBDClone-DRBD_global_clst-mandatory)
  start DRBD_global_clst then start IPaddrNFS (kind:Mandatory) 
(id:order-DRBD_global_clst-IPaddrNFS-mandatory)
Colocation Constraints:
  NFSServer with IPaddrNFS (score:INFINITY) 
(id:colocation-NFSServer-IPaddrNFS-INFINITY)
  DRBD_global_clst with DRBDClone (score:INFINITY) 
(id:colocation-DRBD_global_clst-DRBDClone-INFINITY)
  IPaddrNFS with DRBD_global_clst (score:INFINITY) 
(id:colocation-IPaddrNFS-DRBD_global_clst-INFINITY)

Resources Defaults:
 resource-stickiness: INFINITY
Operations Defaults:
 timeout: 10s

Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.14-8.el6-70404b0
 have-watchdog: false
 last-lrm-refresh: 1478277432
 no-quorum-policy: ignore
 stonith-enabled: true
 symmetric-cluster: true
==================================================================

Initial state is e.g. this (all resources at node1):

Online: [ ventsi-clst1-sync ventsi-clst2-sync ]

Full list of resources:

 ipmi-fence-clst1       (stonith:fence_ipmilan):        Started 
ventsi-clst2-sync
 ipmi-fence-clst2       (stonith:fence_ipmilan):        Started 
ventsi-clst1-sync
 IPaddrNFS      (ocf::heartbeat:IPaddr2):       Started ventsi-clst1-sync
 NFSServer      (ocf::heartbeat:nfsserver):     Started ventsi-clst1-sync
 Master/Slave Set: DRBDClone [DRBD]
     Masters: [ ventsi-clst1-sync ]
     Slaves: [ ventsi-clst2-sync ]
 DRBD_global_clst       (ocf::heartbeat:Filesystem):    Started 
ventsi-clst1-sync
==================================================================

If I shutdown the cluster at node 1 ('pcs cluster stop') or if I move the DRBD 
clone resource ('pcs resource move DRBDClone') all resources switch 
successfully to node2.
I.e. the demote/promote of the DRBD clone resource is working in these cases.

But if I try to move any other resource (e.g. 'pcs resource move NFSServer') 
the resources NFSServer, IPaddrNFS and DRBD_global_clst are stopped at node 1, 
but then already follows starting of the DRBD_global_clst resource at node2, 
which fails due to the missing demote/promote.
As far as I can see there is some follow-up attempt to repair things partially 
as the resources are started again at node1 exclusive the resource which I 
moved due to my move command.

Final state is like this:

Online: [ ventsi-clst1-sync ventsi-clst2-sync ]

Full list of resources:

 ipmi-fence-clst1       (stonith:fence_ipmilan):        Started 
ventsi-clst2-sync
 ipmi-fence-clst2       (stonith:fence_ipmilan):        Started 
ventsi-clst1-sync
 IPaddrNFS      (ocf::heartbeat:IPaddr2):       Started ventsi-clst1-sync
 NFSServer      (ocf::heartbeat:nfsserver):     Stopped
 Master/Slave Set: DRBDClone [DRBD]
     Masters: [ ventsi-clst1-sync ]
     Slaves: [ ventsi-clst2-sync ]
 DRBD_global_clst       (ocf::heartbeat:Filesystem):    Started 
ventsi-clst1-sync

Failed Actions:
* DRBD_global_clst_start_0 on ventsi-clst2-sync 'unknown error' (1): call=778, 
status=complete, exitreason='none',
    last-rc-change='Fri Nov  4 19:32:56 2016', queued=0ms, exec=43ms
==================================================================

Here are the logged messages for this "failure case":

2016-11-04T19:32:55.163982+01:00 ventsi-clst1 crmd[6116]:   notice: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
2016-11-04T19:32:55.168100+01:00 ventsi-clst1 pengine[6115]:   notice: On loss 
of CCM Quorum: Ignore
2016-11-04T19:32:55.181252+01:00 ventsi-clst1 pengine[6115]:   notice: Move    
IPaddrNFS#011(Started ventsi-clst1-sync -> ventsi-clst2-sync)
2016-11-04T19:32:55.181260+01:00 ventsi-clst1 pengine[6115]:   notice: Move    
NFSServer#011(Started ventsi-clst1-sync -> ventsi-clst2-sync)
2016-11-04T19:32:55.181278+01:00 ventsi-clst1 pengine[6115]:   notice: Move    
DRBD_global_clst#011(Started ventsi-clst1-sync -> ventsi-clst2-sync)  <=== here 
no demote/promote is listed
2016-11-04T19:32:55.182385+01:00 ventsi-clst1 pengine[6115]:   notice: 
Calculated Transition 202: /var/lib/pacemaker/pengine/pe-input-766.bz2
2016-11-04T19:32:55.182998+01:00 ventsi-clst1 crmd[6116]:   notice: Initiating 
action 15: stop NFSServer_stop_0 on ventsi-clst1-sync (local)
2016-11-04T19:32:55.196265+01:00 ventsi-clst1 nfsserver(NFSServer)[15978]: 
INFO: Stopping NFS server ...
2016-11-04T19:32:55.249137+01:00 ventsi-clst1 kernel: nfsd: last server has 
exited, flushing export cache
2016-11-04T19:32:55.252241+01:00 ventsi-clst1 rpc.mountd[15282]: Caught signal 
15, un-registering and exiting.
2016-11-04T19:32:55.632708+01:00 ventsi-clst1 nfsserver(NFSServer)[15978]: 
INFO: Stopping sm-notify
2016-11-04T19:32:55.650552+01:00 ventsi-clst1 nfsserver(NFSServer)[15978]: 
INFO: Stopping rpc.statd
2016-11-04T19:32:55.666777+01:00 ventsi-clst1 rpc.statd[15243]: Caught signal 
15, un-registering and exiting
2016-11-04T19:32:56.692819+01:00 ventsi-clst1 nfsserver(NFSServer)[15978]: 
INFO: NFS server stopped
2016-11-04T19:32:56.695523+01:00 ventsi-clst1 crmd[6116]:   notice: Operation 
NFSServer_stop_0: ok (node=ventsi-clst1-sync, call=1220, rc=0, cib-update=1695, 
confirmed=true)
2016-11-04T19:32:56.696243+01:00 ventsi-clst1 crmd[6116]:   notice: Initiating 
action 12: stop IPaddrNFS_stop_0 on ventsi-clst1-sync (local)
2016-11-04T19:32:56.727882+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16108]: INFO: 
IP status = ok, IP_CIP=
2016-11-04T19:32:56.733383+01:00 ventsi-clst1 crmd[6116]:   notice: Operation 
IPaddrNFS_stop_0: ok (node=ventsi-clst1-sync, call=1222, rc=0, cib-update=1696, 
confirmed=true)
2016-11-04T19:32:56.733917+01:00 ventsi-clst1 crmd[6116]:   notice: Initiating 
action 48: stop DRBD_global_clst_stop_0 on ventsi-clst1-sync (local)
2016-11-04T19:32:56.757181+01:00 ventsi-clst1 
Filesystem(DRBD_global_clst)[16163]: INFO: Running stop for /dev/drbd1 on 
/drbdmnts/global_clst
2016-11-04T19:32:56.764684+01:00 ventsi-clst1 
Filesystem(DRBD_global_clst)[16163]: INFO: Trying to unmount 
/drbdmnts/global_clst
2016-11-04T19:32:56.771260+01:00 ventsi-clst1 
Filesystem(DRBD_global_clst)[16163]: INFO: unmounted /drbdmnts/global_clst 
successfully
2016-11-04T19:32:56.776640+01:00 ventsi-clst1 crmd[6116]:   notice: Operation 
DRBD_global_clst_stop_0: ok (node=ventsi-clst1-sync, call=1224, rc=0, 
cib-update=1697, confirmed=true)
2016-11-04T19:32:56.777140+01:00 ventsi-clst1 crmd[6116]:   notice: Initiating 
action 49: start DRBD_global_clst_start_0 on ventsi-clst2-sync   <=== here is 
the attempt to start the filesystem at the other node, although DRBD has not 
yet been promoted
2016-11-04T19:32:56.840137+01:00 ventsi-clst1 crmd[6116]:  warning: Action 49 
(DRBD_global_clst_start_0) on ventsi-clst2-sync failed (target: 0 vs. rc: 1): 
Error
2016-11-04T19:32:56.840158+01:00 ventsi-clst1 crmd[6116]:   notice: Transition 
aborted by DRBD_global_clst_start_0 'modify' on ventsi-clst2-sync: Event failed 
(magic=0:1;49:202:0:b7941532-c74b-40cc-a8ad-27b5502b8fba, cib=0.649.4, 
source=match_graph_event:381, 0)
2016-11-04T19:32:56.840232+01:00 ventsi-clst1 crmd[6116]:  warning: Action 49 
(DRBD_global_clst_start_0) on ventsi-clst2-sync failed (target: 0 vs. rc: 1): 
Error
2016-11-04T19:32:56.840328+01:00 ventsi-clst1 crmd[6116]:   notice: Transition 
202 (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=5, 
Source=/var/lib/pacemaker/pengine/pe-input-766.bz2): Complete
2016-11-04T19:32:56.843693+01:00 ventsi-clst1 pengine[6115]:   notice: On loss 
of CCM Quorum: Ignore
2016-11-04T19:32:56.844072+01:00 ventsi-clst1 pengine[6115]:  warning: 
Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: unknown 
error (1)
2016-11-04T19:32:56.844102+01:00 ventsi-clst1 pengine[6115]:  warning: 
Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: unknown 
error (1)
2016-11-04T19:32:56.845071+01:00 ventsi-clst1 pengine[6115]:   notice: Start   
IPaddrNFS#011(ventsi-clst2-sync)
2016-11-04T19:32:56.845078+01:00 ventsi-clst1 pengine[6115]:   notice: Start   
NFSServer#011(ventsi-clst2-sync)
2016-11-04T19:32:56.845081+01:00 ventsi-clst1 pengine[6115]:   notice: Demote  
DRBD:0#011(Master -> Slave ventsi-clst1-sync)   <=== here there would be the 
necessary demote/promote ... but it's too late; the start of the filesystem 
already failed ...
2016-11-04T19:32:56.845083+01:00 ventsi-clst1 pengine[6115]:   notice: Promote 
DRBD:1#011(Slave -> Master ventsi-clst2-sync)
2016-11-04T19:32:56.845084+01:00 ventsi-clst1 pengine[6115]:   notice: Recover 
DRBD_global_clst#011(Started ventsi-clst2-sync)
2016-11-04T19:32:56.847986+01:00 ventsi-clst1 pengine[6115]:   notice: 
Calculated Transition 203: /var/lib/pacemaker/pengine/pe-input-767.bz2   <=== 
... so the above transition gets caught by the following attempt to repair 
things partially
2016-11-04T19:32:56.867679+01:00 ventsi-clst1 pengine[6115]:   notice: On loss 
of CCM Quorum: Ignore
2016-11-04T19:32:56.868074+01:00 ventsi-clst1 pengine[6115]:  warning: 
Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: unknown 
error (1)
2016-11-04T19:32:56.868101+01:00 ventsi-clst1 pengine[6115]:  warning: 
Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: unknown 
error (1)
2016-11-04T19:32:56.868287+01:00 ventsi-clst1 pengine[6115]:  warning: Forcing 
DRBD_global_clst away from ventsi-clst2-sync after 1000000 failures 
(max=1000000)
2016-11-04T19:32:56.869011+01:00 ventsi-clst1 pengine[6115]:   notice: Start   
IPaddrNFS#011(ventsi-clst1-sync)
2016-11-04T19:32:56.869023+01:00 ventsi-clst1 pengine[6115]:   notice: Recover 
DRBD_global_clst#011(Started ventsi-clst2-sync -> ventsi-clst1-sync)
2016-11-04T19:32:56.869770+01:00 ventsi-clst1 pengine[6115]:   notice: 
Calculated Transition 204: /var/lib/pacemaker/pengine/pe-input-768.bz2
2016-11-04T19:32:56.870065+01:00 ventsi-clst1 crmd[6116]:   notice: Initiating 
action 3: stop DRBD_global_clst_stop_0 on ventsi-clst2-sync
2016-11-04T19:32:56.908075+01:00 ventsi-clst1 crmd[6116]:   notice: Initiating 
action 42: start DRBD_global_clst_start_0 on ventsi-clst1-sync (local)
2016-11-04T19:32:56.931072+01:00 ventsi-clst1 
Filesystem(DRBD_global_clst)[16242]: INFO: Running start for /dev/drbd1 on 
/drbdmnts/global_clst
2016-11-04T19:32:56.943250+01:00 ventsi-clst1 kernel: EXT4-fs (drbd1): warning: 
maximal mount count reached, running e2fsck is recommended
2016-11-04T19:32:56.953253+01:00 ventsi-clst1 kernel: EXT4-fs (drbd1): mounted 
filesystem with ordered data mode. Opts:
2016-11-04T19:32:56.964284+01:00 ventsi-clst1 crmd[6116]:   notice: Operation 
DRBD_global_clst_start_0: ok (node=ventsi-clst1-sync, call=1225, rc=0, 
cib-update=1701, confirmed=true)
2016-11-04T19:32:56.965104+01:00 ventsi-clst1 crmd[6116]:   notice: Initiating 
action 10: start IPaddrNFS_start_0 on ventsi-clst1-sync (local)
2016-11-04T19:32:56.965325+01:00 ventsi-clst1 crmd[6116]:   notice: Initiating 
action 43: monitor DRBD_global_clst_monitor_20000 on ventsi-clst1-sync (local)
2016-11-04T19:32:56.996235+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16308]: INFO: 
Adding inet address xxx.xxx.xxx.xxx/24 with broadcast address xxx.xxx.xxx.255 
to device bond0
2016-11-04T19:32:57.002059+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16308]: INFO: 
Bringing device bond0 up
2016-11-04T19:32:57.008128+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16308]: INFO: 
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p 
/var/run/resource-agents/send_arp-xxx.xxx.xxx.xxx bond0 xxx.xxx.xxx.xxx auto 
not_used not_used
2016-11-04T19:32:57.020159+01:00 ventsi-clst1 crmd[6116]:   notice: Operation 
IPaddrNFS_start_0: ok (node=ventsi-clst1-sync, call=1226, rc=0, 
cib-update=1703, confirmed=true)
2016-11-04T19:32:57.020901+01:00 ventsi-clst1 crmd[6116]:   notice: Initiating 
action 11: monitor IPaddrNFS_monitor_5000 on ventsi-clst1-sync (local)
2016-11-04T19:32:57.052231+01:00 ventsi-clst1 crmd[6116]:   notice: Transition 
204 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-768.bz2): Complete
2016-11-04T19:32:57.052251+01:00 ventsi-clst1 crmd[6116]:   notice: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
==================================================================

Any ideas what could be the reason for this behavior?
And how could this be fixed?


(I already found several articles on the internet with the recommendation to 
have two separately configured monitor operations for the DRBD resource 
configured one for the master role and another one for the slave role.
Already tried this to no avail.)

Regards
Andi

_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] DRBD demote/promote not called - Why? How to fix?

Reply via email to