Re: [DRBD-user] DRBD fencing prevents resource promotion in active/passive cluster

2016-09-20 Thread Lars Ellenberg
On Tue, Sep 20, 2016 at 12:25:55PM +, Auer, Jens wrote:
> Hi,
> 
> > Don't disable fencing!
> 
> > You need to configure and test stonith in pacemaker. Once that's
> > working, then you set DRBD's fencing to 'resource-and-stonith;' and
> > configure the 'crm-{un,}fence-handler.sh' un/fence handlers.
> 
> > With this, if a node fails (and no, redundant network links is not
> > enough, nodes can die in many ways), then drbd will block when the peer
> > is lost, call the fence handler and wait for pacemaker to report back
> > that the fence action was completed. This way, you will never get a
> > split-brain and you will get reliable recovery.
> 
> While we will configure fencing finally (and I know that nodes can
> fail in many ways), it should not be influence the test I am doing
> because the nodes are not on any unknown state. I have three
> independant network connections, one for DRBD, one for corosync
> heartbeats and one for data. In the test, I stop the cluster node
> manually with 'pcs cluster stop'. I don't think this should trigger
> STONITH or fencing, but the DRBD fails to get promoted permanently.

The fencing constraint has been created at some point in time,
probably correctly.

But apparently it has never been removed, possibly for good reasons,
possibly by accident (not enough information to guess that).

The fencing constraint is supposed to be removed
once that drbd resource is fully synced up again.

Go over your logs, find the invocation of the "unfence",
and figure out why it did not work at that time.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] DRBD fencing prevents resource promotion in active/passive cluster

2016-09-20 Thread Auer, Jens
Hi,

> Don't disable fencing!

> You need to configure and test stonith in pacemaker. Once that's
> working, then you set DRBD's fencing to 'resource-and-stonith;' and
> configure the 'crm-{un,}fence-handler.sh' un/fence handlers.

> With this, if a node fails (and no, redundant network links is not
> enough, nodes can die in many ways), then drbd will block when the peer
> is lost, call the fence handler and wait for pacemaker to report back
> that the fence action was completed. This way, you will never get a
> split-brain and you will get reliable recovery.

While we will configure fencing finally (and I know that nodes can fail in many 
ways), it should not be influence the test I am doing because the nodes are not 
on any unknown state. I have three independant network connections, one for 
DRBD, one for corosync heartbeats and one for data. In the test, I stop the 
cluster node manually with 'pcs cluster stop'. I don't think this should 
trigger STONITH or fencing, but the DRBD fails to get promoted permanently.

Cheers,
  Jens
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] DRBD fencing prevents resource promotion in active/passive cluster

2016-09-20 Thread Digimer
On 20/09/16 07:07 AM, Auer, Jens wrote:
> Hi,
> 
> I am using a drbd device in an active/passive cluster setup with pacemaker. 
> We have dedicated connections for corosync heartbeats, drbd and a 10GB data 
> connection:
> - A bonded 10GB network card for data traffic that will be accessed via a 
> virtual ip managed by pacemaker in 192.168.120.1/24. In the cluster nodes 
> MDA1PFP-S01 and MDA1PFP-S02 are assigned to 192.168.120.10 and 192.168.120.11.
> 
> - A dedicated back-to-back connection for corosync heartbeats in 
> 192.168.121.1/24. MDA1PFP-PCS01 and MDA1PFP-S02 are assigned to 
> 192.168.121.10 and 192.168.121.11. When the cluster is created, we use these 
> as primary node names and use the 10GB device as a second backup connection 
> for increased reliability: pcs cluster setup --name MDA1PFP 
> MDA1PFP-PCS01,MDA1PFP-S01 MDA1PFP-PCS02,MDA1PFP-S02
> 
> - A dedicated back-to-back connection for drbd in 192.168.122.1/24. Hosts 
> MDA1PFP-DRBD01 and MDA1PFP-DRBD02 are assigned 192.168.23.10 and 
> 192.168.123.11.
> 
> In my tests, I force a failover by
> 1. Shutdown the cluster node with the master with pcs cluster stop 2. Disable 
> the network device for the virtual ip with ifdown and wait until ping detects 
> it
> 
> The initial state of the cluster is
> MDA1PFP-S01 14:40:27 1803 0 ~ # pcs status Cluster name: MDA1PFP
> Last updated: Fri Sep 16 14:41:18 2016Last change: Fri Sep 16 
> 14:39:49 2016 by root via cibadmin on MDA1PFP-PCS01
> Stack: corosync
> Current DC: MDA1PFP-PCS02 (version 1.1.13-10.el7-44eb2dd) - partition with 
> quorum
> 2 nodes and 7 resources configured
> 
> Online: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]
> 
> Full list of resources:
> 
>  Master/Slave Set: drbd1_sync [drbd1]
>  Masters: [ MDA1PFP-PCS02 ]
>  Slaves: [ MDA1PFP-PCS01 ]
>  mda-ip(ocf::heartbeat:IPaddr2):Started MDA1PFP-PCS02
>  Clone Set: ping-clone [ping]
>  Started: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]
>  ACTIVE(ocf::heartbeat:Dummy):Started MDA1PFP-PCS02
>  shared_fs(ocf::heartbeat:Filesystem):Started MDA1PFP-PCS02
> 
> PCSD Status:
>   MDA1PFP-PCS01: Online
>   MDA1PFP-PCS02: Online
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> 
> MDA1PFP-S01 14:41:19 1804 0 ~ # pcs resource --full
>  Master: drbd1_sync
>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 
> notify=true
>   Resource: drbd1 (class=ocf provider=linbit type=drbd)
>Attributes: drbd_resource=shared_fs 
>Operations: start interval=0s timeout=240 (drbd1-start-interval-0s)
>promote interval=0s timeout=90 (drbd1-promote-interval-0s)
>demote interval=0s timeout=90 (drbd1-demote-interval-0s)
>stop interval=0s timeout=100 (drbd1-stop-interval-0s)
>monitor interval=60s (drbd1-monitor-interval-60s)
>  Resource: mda-ip (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: ip=192.168.120.20 cidr_netmask=32 nic=bond0
>   Operations: start interval=0s timeout=20s (mda-ip-start-interval-0s)
>   stop interval=0s timeout=20s (mda-ip-stop-interval-0s)
>   monitor interval=1s (mda-ip-monitor-interval-1s)
>  Clone: ping-clone
>   Resource: ping (class=ocf provider=pacemaker type=ping)
>Attributes: dampen=5s multiplier=1000 host_list=pf-pep-dev-1 timeout=1 
> attempts=3 
>Operations: start interval=0s timeout=60 (ping-start-interval-0s)
>stop interval=0s timeout=20 (ping-stop-interval-0s)
>monitor interval=1 (ping-monitor-interval-1)
>  Resource: ACTIVE (class=ocf provider=heartbeat type=Dummy)
>   Operations: start interval=0s timeout=20 (ACTIVE-start-interval-0s)
>   stop interval=0s timeout=20 (ACTIVE-stop-interval-0s)
>   monitor interval=10 timeout=20 (ACTIVE-monitor-interval-10)
>  Resource: shared_fs (class=ocf provider=heartbeat type=Filesystem)
>   Attributes: device=/dev/drbd1 directory=/shared_fs fstype=xfs
>   Operations: start interval=0s timeout=60 (shared_fs-start-interval-0s)
>   stop interval=0s timeout=60 (shared_fs-stop-interval-0s)
>   monitor interval=20 timeout=40 (shared_fs-monitor-interval-20)
> 
> MDA1PFP-S01 14:41:35 1805 0 ~ # pcs constraint --full Location Constraints:
>   Resource: mda-ip
> Enabled on: MDA1PFP-PCS01 (score:50) (id:location-mda-ip-MDA1PFP-PCS01-50)
> Constraint: location-mda-ip
>   Rule: score=-INFINITY boolean-op=or  (id:location-mda-ip-rule)
> Expression: pingd lt 1  (id:location-mda-ip-rule-expr)
> Expression: not_defined pingd  (id:location-mda-ip-rule-expr-1) 
> Ordering Constraints:
>   start ping-clone then start mda-ip (kind:Optional) 
> (id:order-ping-clone-mda-ip-Optional)
>   promote drbd1_sync then start shared_fs (kind:Mandatory) 
> (id:order-drbd1_sync-shared_fs-mandatory)
> Colocation Constraints:
>   ACTIVE with mda-ip (score:INFINITY) (id:colocation-ACTIVE-mda-ip-INFINITY)
>  

[DRBD-user] DRBD fencing prevents resource promotion in active/passive cluster

2016-09-20 Thread Auer, Jens
Hi,

I am using a drbd device in an active/passive cluster setup with pacemaker. We 
have dedicated connections for corosync heartbeats, drbd and a 10GB data 
connection:
- A bonded 10GB network card for data traffic that will be accessed via a 
virtual ip managed by pacemaker in 192.168.120.1/24. In the cluster nodes 
MDA1PFP-S01 and MDA1PFP-S02 are assigned to 192.168.120.10 and 192.168.120.11.

- A dedicated back-to-back connection for corosync heartbeats in 
192.168.121.1/24. MDA1PFP-PCS01 and MDA1PFP-S02 are assigned to 192.168.121.10 
and 192.168.121.11. When the cluster is created, we use these as primary node 
names and use the 10GB device as a second backup connection for increased 
reliability: pcs cluster setup --name MDA1PFP MDA1PFP-PCS01,MDA1PFP-S01 
MDA1PFP-PCS02,MDA1PFP-S02

- A dedicated back-to-back connection for drbd in 192.168.122.1/24. Hosts 
MDA1PFP-DRBD01 and MDA1PFP-DRBD02 are assigned 192.168.23.10 and 192.168.123.11.

In my tests, I force a failover by
1. Shutdown the cluster node with the master with pcs cluster stop 2. Disable 
the network device for the virtual ip with ifdown and wait until ping detects it

The initial state of the cluster is
MDA1PFP-S01 14:40:27 1803 0 ~ # pcs status Cluster name: MDA1PFP
Last updated: Fri Sep 16 14:41:18 2016Last change: Fri Sep 16 14:39:49 
2016 by root via cibadmin on MDA1PFP-PCS01
Stack: corosync
Current DC: MDA1PFP-PCS02 (version 1.1.13-10.el7-44eb2dd) - partition with 
quorum
2 nodes and 7 resources configured

Online: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]

Full list of resources:

 Master/Slave Set: drbd1_sync [drbd1]
 Masters: [ MDA1PFP-PCS02 ]
 Slaves: [ MDA1PFP-PCS01 ]
 mda-ip(ocf::heartbeat:IPaddr2):Started MDA1PFP-PCS02
 Clone Set: ping-clone [ping]
 Started: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]
 ACTIVE(ocf::heartbeat:Dummy):Started MDA1PFP-PCS02
 shared_fs(ocf::heartbeat:Filesystem):Started MDA1PFP-PCS02

PCSD Status:
  MDA1PFP-PCS01: Online
  MDA1PFP-PCS02: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

MDA1PFP-S01 14:41:19 1804 0 ~ # pcs resource --full
 Master: drbd1_sync
  Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 
notify=true
  Resource: drbd1 (class=ocf provider=linbit type=drbd)
   Attributes: drbd_resource=shared_fs 
   Operations: start interval=0s timeout=240 (drbd1-start-interval-0s)
   promote interval=0s timeout=90 (drbd1-promote-interval-0s)
   demote interval=0s timeout=90 (drbd1-demote-interval-0s)
   stop interval=0s timeout=100 (drbd1-stop-interval-0s)
   monitor interval=60s (drbd1-monitor-interval-60s)
 Resource: mda-ip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=192.168.120.20 cidr_netmask=32 nic=bond0
  Operations: start interval=0s timeout=20s (mda-ip-start-interval-0s)
  stop interval=0s timeout=20s (mda-ip-stop-interval-0s)
  monitor interval=1s (mda-ip-monitor-interval-1s)
 Clone: ping-clone
  Resource: ping (class=ocf provider=pacemaker type=ping)
   Attributes: dampen=5s multiplier=1000 host_list=pf-pep-dev-1 timeout=1 
attempts=3 
   Operations: start interval=0s timeout=60 (ping-start-interval-0s)
   stop interval=0s timeout=20 (ping-stop-interval-0s)
   monitor interval=1 (ping-monitor-interval-1)
 Resource: ACTIVE (class=ocf provider=heartbeat type=Dummy)
  Operations: start interval=0s timeout=20 (ACTIVE-start-interval-0s)
  stop interval=0s timeout=20 (ACTIVE-stop-interval-0s)
  monitor interval=10 timeout=20 (ACTIVE-monitor-interval-10)
 Resource: shared_fs (class=ocf provider=heartbeat type=Filesystem)
  Attributes: device=/dev/drbd1 directory=/shared_fs fstype=xfs
  Operations: start interval=0s timeout=60 (shared_fs-start-interval-0s)
  stop interval=0s timeout=60 (shared_fs-stop-interval-0s)
  monitor interval=20 timeout=40 (shared_fs-monitor-interval-20)

MDA1PFP-S01 14:41:35 1805 0 ~ # pcs constraint --full Location Constraints:
  Resource: mda-ip
Enabled on: MDA1PFP-PCS01 (score:50) (id:location-mda-ip-MDA1PFP-PCS01-50)
Constraint: location-mda-ip
  Rule: score=-INFINITY boolean-op=or  (id:location-mda-ip-rule)
Expression: pingd lt 1  (id:location-mda-ip-rule-expr)
Expression: not_defined pingd  (id:location-mda-ip-rule-expr-1) 
Ordering Constraints:
  start ping-clone then start mda-ip (kind:Optional) 
(id:order-ping-clone-mda-ip-Optional)
  promote drbd1_sync then start shared_fs (kind:Mandatory) 
(id:order-drbd1_sync-shared_fs-mandatory)
Colocation Constraints:
  ACTIVE with mda-ip (score:INFINITY) (id:colocation-ACTIVE-mda-ip-INFINITY)
  drbd1_sync with mda-ip (score:INFINITY) (rsc-role:Master) 
(with-rsc-role:Started) (id:colocation-drbd1_sync-mda-ip-INFINITY)
  shared_fs with drbd1_sync (score:INFINITY) (rsc-role:Started) 
(with-rsc-role:Master) (id:colocation-shared_fs-drbd1_sync