Re: [ClusterLabs] [DRBD-user] DRBD fencing issue on failover causes resource failure

Thomas Lamprecht Sat, 19 Mar 2016 00:14:45 -0700


On 16.03.2016 18:51, Tim Walberg wrote:
> Is there a way to make this work properly without STONITH? I forgot to mention
> that both nodes are virtual machines (QEMU/KVM), which makes STONITH a minor
> challenge. Also, since these symptoms occur even under "pcs cluster standby",
> where STONITH *shouldn't* be invoked, I'm not sure if that's the entire 
> answer.
>


There exists a lot fence agents which make use of the hypervisor which
hosts the VMs, e.g. fence_pve for Proxmox VE virtual machines, vmware,
virtual box, xen are also implemented, libvirt should be but I don't
know for sure.

See:
https://github.com/ClusterLabs/fence-agents
https://fedorahosted.org/cluster/wiki/fence-agents

This is a fairly easy way to setup fencing for me and I use it quite
often for tests.

I didn't set pacemaker up with such an agent but I see no problem which
could prevent that.

cheers,
Thomas

> 
> On 03/16/2016 13:34 -0400, Digimer wrote:
>>>     On 16/03/16 01:17 PM, Tim Walberg wrote:
>>>     > Having an issue on a newly built CentOS 7.2.1511 NFS cluster with DRBD
>>>     > (drbd84-utils-8.9.5-1 with kmod-drbd84-8.4.7-1_1). At this point, the
>>>     > resources consist of a cluster address, a DRBD device mirroring 
>>> between
>>>     > the two cluster nodes, the file system, and the nfs-server resource. 
>>> The
>>>     > resources all behave properly until an extended failover or outage.
>>>     > 
>>>     > I have tested failover in several ways ("pcs cluster standby", "pcs
>>>     > cluster stop", "init 0", "init 6", "echo b > /proc/sysrq-trigger", 
>>> etc.)
>>>     > and the symptoms are that, until the killed node is brought back into
>>>     > the cluster, failover never seems to complete. The DRBD device appears
>>>     > on the remaining node to be in a "Secondary/Unknown" state, and the
>>>     > resources end up looking like:
>>>     > 
>>>     > # pcs status
>>>     > Cluster name: nfscluster
>>>     > Last updated: Wed Mar 16 12:05:33 2016          Last change: Wed Mar 
>>> 16
>>>     > 12:04:46 2016 by root via cibadmin on nfsnode01
>>>     > Stack: corosync
>>>     > Current DC: nfsnode01 (version 1.1.13-10.el7_2.2-44eb2dd) - partition
>>>     > with quorum
>>>     > 2 nodes and 5 resources configured
>>>     > 
>>>     > Online: [ nfsnode01 ]
>>>     > OFFLINE: [ nfsnode02 ]
>>>     > 
>>>     > Full list of resources:
>>>     > 
>>>     >  nfsVIP      (ocf::heartbeat:IPaddr2):       Started nfsnode01
>>>     >  nfs-server     (systemd:nfs-server):   Stopped
>>>     >  Master/Slave Set: drbd_master [drbd_dev]
>>>     >      Slaves: [ nfsnode01 ]
>>>     >      Stopped: [ nfsnode02 ]
>>>     >  drbd_fs   (ocf::heartbeat:Filesystem):    Stopped
>>>     > 
>>>     > PCSD Status:
>>>     >   nfsnode01: Online
>>>     >   nfsnode02: Online
>>>     > 
>>>     > Daemon Status:
>>>     >   corosync: active/enabled
>>>     >   pacemaker: active/enabled
>>>     >   pcsd: active/enabled
>>>     > 
>>>     > As soon as I bring the second node back online, the failover 
>>> completes.
>>>     > But this is obviously not a good state, as an extended outage for any
>>>     > reason on one node essentially kills the cluster services. There's
>>>     > obviously something I've missed in configuring the resources, but I
>>>     > haven't been able to pinpoint it yet.
>>>     > 
>>>     > Perusing the logs, it appears that, upon the initial failure, DRBD 
>>> does
>>>     > in fact promote the drbd_master resource, but immediately after that,
>>>     > pengine calls for it to be demoted for reasons I haven't been able to
>>>     > determine yet, but seems to be tied to the fencing configuration. I 
>>> can
>>>     > see that the crm-fence-peer.sh script is called, but it almost seems
>>>     > like it's fencing the wrong node... Indeed, I do see that it adds a
>>>     > -INFINITY location constraint for the surviving node, which would
>>>     > explain the decision to demote the DRBD master.
>>>     > 
>>>     > My DRBD resource looks like this:
>>>     > 
>>>     > # cat /etc/drbd.d/drbd0.res
>>>     > resource drbd0 {
>>>     > 
>>>     >         protocol C;
>>>     >         startup { wfc-timeout 0; degr-wfc-timeout 120; }
>>>     > 
>>>     >         disk {
>>>     >             on-io-error detach;
>>>     >             fencing resource-only;
>>>     
>>>     This should be 'resource-and-stonith;', but alone won't do anything
>>>     until pacemaker's stonith is working.
>>>     
>>>     >         }
>>>     > 
>>>     >         handlers {
>>>     >             fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>>>     >             after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>>>     >         }
>>>     > 
>>>     >         on nfsnode01 {
>>>     >                 device /dev/drbd0;
>>>     >                 disk /dev/vg_nfs/lv_drbd0;
>>>     >                 meta-disk internal;
>>>     >                 address 10.0.0.2:7788 <http://10.0.0.2:7788>;
>>>     >         }
>>>     > 
>>>     >         on nfsnode02 {
>>>     >                 device /dev/drbd0;
>>>     >                 disk /dev/vg_nfs/lv_drbd0;
>>>     >                 meta-disk internal;
>>>     >                 address 10.0.0.3:7788 <http://10.0.0.3:7788>;
>>>     >         }
>>>     > }
>>>     > 
>>>     > If I comment out the three lines having to do with fencing, the 
>>> failover
>>>     > works properly. But I'd prefer to have the fencing there in the odd
>>>     > chance that we end up with a split brain instead of just a node 
>>> outage...
>>>     > 
>>>     > And, here's "pcs config --full":
>>>     > 
>>>     > # pcs config --full
>>>     > Cluster Name: nfscluster
>>>     > Corosync Nodes:
>>>     >  nfsnode01 nfsnode02
>>>     > Pacemaker Nodes:
>>>     >  nfsnode01 nfsnode02
>>>     > 
>>>     > Resources:
>>>     >  Resource: nfsVIP (class=ocf provider=heartbeat type=IPaddr2)
>>>     >   Attributes: ip=10.0.0.1 cidr_netmask=24
>>>     >   Operations: start interval=0s timeout=20s (nfsVIP-start-interval-0s)
>>>     >               stop interval=0s timeout=20s (nfsVIP-stop-interval-0s)
>>>     >               monitor interval=15s (nfsVIP-monitor-interval-15s)
>>>     >  Resource: nfs-server (class=systemd type=nfs-server)
>>>     >   Operations: monitor interval=60s (nfs-server-monitor-interval-60s)
>>>     >  Master: drbd_master
>>>     >   Meta Attrs: master-max=1 master-node-max=1 clone-max=2
>>>     > clone-node-max=1 notify=true
>>>     >   Resource: drbd_dev (class=ocf provider=linbit type=drbd)
>>>     >    Attributes: drbd_resource=drbd0
>>>     >    Operations: start interval=0s timeout=240 
>>> (drbd_dev-start-interval-0s)
>>>     >                promote interval=0s timeout=90 
>>> (drbd_dev-promote-interval-0s)
>>>     >                demote interval=0s timeout=90 
>>> (drbd_dev-demote-interval-0s)
>>>     >                stop interval=0s timeout=100 
>>> (drbd_dev-stop-interval-0s)
>>>     >                monitor interval=29s role=Master
>>>     > (drbd_dev-monitor-interval-29s)
>>>     >                monitor interval=31s role=Slave
>>>     > (drbd_dev-monitor-interval-31s)
>>>     >  Resource: drbd_fs (class=ocf provider=heartbeat type=Filesystem)
>>>     >   Attributes: device=/dev/drbd0 directory=/exports/drbd0 fstype=xfs
>>>     >   Operations: start interval=0s timeout=60 (drbd_fs-start-interval-0s)
>>>     >               stop interval=0s timeout=60 (drbd_fs-stop-interval-0s)
>>>     >               monitor interval=20 timeout=40 
>>> (drbd_fs-monitor-interval-20)
>>>     > 
>>>     > Stonith Devices:
>>>     > Fencing Levels:
>>>     > 
>>>     > Location Constraints:
>>>     > Ordering Constraints:
>>>     >   start nfsVIP then start nfs-server (kind:Mandatory)
>>>     > (id:order-nfsVIP-nfs-server-mandatory)
>>>     >   start drbd_fs then start nfs-server (kind:Mandatory)
>>>     > (id:order-drbd_fs-nfs-server-mandatory)
>>>     >   promote drbd_master then start drbd_fs (kind:Mandatory)
>>>     > (id:order-drbd_master-drbd_fs-mandatory)
>>>     > Colocation Constraints:
>>>     >   nfs-server with nfsVIP (score:INFINITY)
>>>     > (id:colocation-nfs-server-nfsVIP-INFINITY)
>>>     >   nfs-server with drbd_fs (score:INFINITY)
>>>     > (id:colocation-nfs-server-drbd_fs-INFINITY)
>>>     >   drbd_fs with drbd_master (score:INFINITY) (with-rsc-role:Master)
>>>     > (id:colocation-drbd_fs-drbd_master-INFINITY)
>>>     > 
>>>     > Resources Defaults:
>>>     >  resource-stickiness: 100
>>>     >  failure-timeout: 60
>>>     > Operations Defaults:
>>>     >  No defaults set
>>>     > 
>>>     > Cluster Properties:
>>>     >  cluster-infrastructure: corosync
>>>     >  cluster-name: nfscluster
>>>     >  dc-version: 1.1.13-10.el7_2.2-44eb2dd
>>>     >  have-watchdog: false
>>>     >  maintenance-mode: false
>>>     >  stonith-enabled: false
>>>     
>>>     Configure *and test* stonith in pacemaker first, then DRBD will hook
>>>     into it and use it properly. DRBD simply asks pacemaker to do the fence,
>>>     but you currently don't have it setup.
>>>     
>>>     -- 
>>>     Digimer
>>>     Papers and Projects: https://alteeve.ca/w/
>>>     What if the cure for cancer is trapped in the mind of a person without
>>>     access to education?
>>>     _______________________________________________
>>>     drbd-user mailing list
>>>     [email protected]
>>>     http://lists.linbit.com/mailman/listinfo/drbd-user
> End of included message
> 
> 
> 


_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [DRBD-user] DRBD fencing issue on failover causes resource failure

Reply via email to