I may be wrong, but shouldn't "clone-node-max" be 2 on the ms_drbd_vmfs resource?
Luke Pascoe *E* l...@osnz.co.nz * P* +64 (9) 296 2961 * M* +64 (27) 426 6649 * W* www.osnz.co.nz 24 Wellington St Papakura Auckland, 2110 New Zealand On 18 September 2015 at 11:02, Jason Gress <jgr...@accertify.com> wrote: > I have a simple DRBD + filesystem + NFS configuration that works properly > when I manually start/stop DRBD, but will not start the DRBD slave resource > properly on failover or recovery. I cannot ever get the Master/Slave set > to say anything but 'Stopped'. I am running CentOS 7.1 with the latest > packages as of today: > > [root@fx201-1a log]# rpm -qa | grep -e pcs -e pacemaker -e drbd > pacemaker-cluster-libs-1.1.12-22.el7_1.4.x86_64 > pacemaker-1.1.12-22.el7_1.4.x86_64 > pcs-0.9.137-13.el7_1.4.x86_64 > pacemaker-libs-1.1.12-22.el7_1.4.x86_64 > drbd84-utils-8.9.3-1.1.el7.elrepo.x86_64 > pacemaker-cli-1.1.12-22.el7_1.4.x86_64 > kmod-drbd84-8.4.6-1.el7.elrepo.x86_64 > > Here is my pcs config output: > > [root@fx201-1a log]# pcs config > Cluster Name: fx201-vmcl > Corosync Nodes: > fx201-1a.ams fx201-1b.ams > Pacemaker Nodes: > fx201-1a.ams fx201-1b.ams > > Resources: > Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) > Attributes: ip=10.XX.XX.XX cidr_netmask=24 > Operations: start interval=0s timeout=20s (ClusterIP-start-timeout-20s) > stop interval=0s timeout=20s (ClusterIP-stop-timeout-20s) > monitor interval=15s (ClusterIP-monitor-interval-15s) > Master: ms_drbd_vmfs > Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 > notify=true > Resource: drbd_vmfs (class=ocf provider=linbit type=drbd) > Attributes: drbd_resource=vmfs > Operations: start interval=0s timeout=240 (drbd_vmfs-start-timeout-240) > promote interval=0s timeout=90 > (drbd_vmfs-promote-timeout-90) > demote interval=0s timeout=90 (drbd_vmfs-demote-timeout-90) > stop interval=0s timeout=100 (drbd_vmfs-stop-timeout-100) > monitor interval=30s (drbd_vmfs-monitor-interval-30s) > Resource: vmfsFS (class=ocf provider=heartbeat type=Filesystem) > Attributes: device=/dev/drbd0 directory=/exports/vmfs fstype=xfs > Operations: start interval=0s timeout=60 (vmfsFS-start-timeout-60) > stop interval=0s timeout=60 (vmfsFS-stop-timeout-60) > monitor interval=20 timeout=40 (vmfsFS-monitor-interval-20) > Resource: nfs-server (class=systemd type=nfs-server) > Operations: monitor interval=60s (nfs-server-monitor-interval-60s) > > Stonith Devices: > Fencing Levels: > > Location Constraints: > Ordering Constraints: > promote ms_drbd_vmfs then start vmfsFS (kind:Mandatory) > (id:order-ms_drbd_vmfs-vmfsFS-mandatory) > start vmfsFS then start nfs-server (kind:Mandatory) > (id:order-vmfsFS-nfs-server-mandatory) > start ClusterIP then start nfs-server (kind:Mandatory) > (id:order-ClusterIP-nfs-server-mandatory) > Colocation Constraints: > ms_drbd_vmfs with ClusterIP (score:INFINITY) > (id:colocation-ms_drbd_vmfs-ClusterIP-INFINITY) > vmfsFS with ms_drbd_vmfs (score:INFINITY) (with-rsc-role:Master) > (id:colocation-vmfsFS-ms_drbd_vmfs-INFINITY) > nfs-server with vmfsFS (score:INFINITY) > (id:colocation-nfs-server-vmfsFS-INFINITY) > > Cluster Properties: > cluster-infrastructure: corosync > cluster-name: fx201-vmcl > dc-version: 1.1.13-a14efad > have-watchdog: false > last-lrm-refresh: 1442528181 > stonith-enabled: false > > And status: > > [root@fx201-1a log]# pcs status --full > Cluster name: fx201-vmcl > Last updated: Thu Sep 17 17:55:56 2015 Last change: Thu Sep 17 17:18:10 > 2015 by root via crm_attribute on fx201-1b.ams > Stack: corosync > Current DC: fx201-1b.ams (2) (version 1.1.13-a14efad) - partition with > quorum > 2 nodes and 5 resources configured > > Online: [ fx201-1a.ams (1) fx201-1b.ams (2) ] > > Full list of resources: > > ClusterIP (ocf::heartbeat:IPaddr2): Started fx201-1a.ams > Master/Slave Set: ms_drbd_vmfs [drbd_vmfs] > drbd_vmfs (ocf::linbit:drbd): Master fx201-1a.ams > drbd_vmfs (ocf::linbit:drbd): Stopped > Masters: [ fx201-1a.ams ] > Stopped: [ fx201-1b.ams ] > vmfsFS (ocf::heartbeat:Filesystem): Started fx201-1a.ams > nfs-server (systemd:nfs-server): Started fx201-1a.ams > > PCSD Status: > fx201-1a.ams: Online > fx201-1b.ams: Online > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > > If I do a failover, after manually confirming that the DRBD data is > synchronized completely, it does work, but then never reconnects the > secondary side, and in order to get the resource synchronized again, I have > to manually correct it, ad infinitum. I have tried standby/unstandby, pcs > resource debug-start (with undesirable results), and so on. > > Here are some relevant log messages from pacemaker.log: > > Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net crmd: info: > crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms) > Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net crmd: notice: > do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ > input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] > Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net crmd: info: > do_state_transition: Progressed to state S_POLICY_ENGINE after > C_TIMER_POPPED > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > process_pe_message: Input has not changed since last time, not saving to > disk > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > determine_online_status: Node fx201-1b.ams is online > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > determine_online_status: Node fx201-1a.ams is online > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > determine_op_status: Operation monitor found resource drbd_vmfs:0 active > in master mode on fx201-1b.ams > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > determine_op_status: Operation monitor found resource drbd_vmfs:0 active > on fx201-1a.ams > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > native_print: ClusterIP (ocf::heartbeat:IPaddr2): Started fx201-1a.ams > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > clone_print: Master/Slave Set: ms_drbd_vmfs [drbd_vmfs] > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > short_print: Masters: [ fx201-1a.ams ] > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > short_print: Stopped: [ fx201-1b.ams ] > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > native_print: vmfsFS (ocf::heartbeat:Filesystem): Started fx201-1a.ams > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > native_print: nfs-server (systemd:nfs-server): Started fx201-1a.ams > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > native_color: Resource drbd_vmfs:1 cannot run anywhere > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > master_color: Promoting drbd_vmfs:0 (Master fx201-1a.ams) > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > master_color: ms_drbd_vmfs: Promoted 1 instances of a possible 1 to master > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > LogActions: Leave ClusterIP (Started fx201-1a.ams) > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > LogActions: Leave drbd_vmfs:0 (Master fx201-1a.ams) > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > LogActions: Leave drbd_vmfs:1 (Stopped) > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > LogActions: Leave vmfsFS (Started fx201-1a.ams) > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: info: > LogActions: Leave nfs-server (Started fx201-1a.ams) > Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net pengine: notice: > process_pe_message: Calculated Transition 16: > /var/lib/pacemaker/pengine/pe-input-61.bz2 > Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net crmd: info: > do_state_transition: State transition S_POLICY_ENGINE -> > S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE > origin=handle_response ] > Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net crmd: info: > do_te_invoke: Processing graph 16 (ref=pe_calc-dc-1442530090-97) derived > from /var/lib/pacemaker/pengine/pe-input-61.bz2 > Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net crmd: notice: > run_graph: Transition 16 (Complete=0, Pending=0, Fired=0, Skipped=0, > Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-61.bz2): Complete > Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net crmd: info: > do_log: FSA: Input I_TE_SUCCESS from notify_crmd() received in state > S_TRANSITION_ENGINE > Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net crmd: notice: > do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ > input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] > > Thank you all for your help, > > Jason > > > "This message and any attachments may contain confidential information. If you > have received this message in error, any use or distribution is prohibited. > Please notify us by reply e-mail if you have mistakenly received this message, > and immediately and permanently delete it and any attachments. Thank you." > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > >
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org