[ClusterLabs] Need help debugging a STONITH resource

Casey & Gina Wed, 11 Jul 2018 11:56:19 -0700

I have a number of clusters in a vmWare ESX environment which have all been set 
up following the same steps, unless somehow I did something wrong on some 
without realizing it.


The issue I am facing is that on some of the clusters, after adding the STONITH 
resource, testing with `stonith_admin -F <node_hostname>` is failing with the 
error "Command failed: No route to host".  Executing it with --verbose adds no 
additional output.

The stonith plugin I am using is external/vcenter, which in turn utilizes the 
vSphere CLI package.  I'm not certain what command it might be trying to run, 
or how to debug this further...  It's not an ESX issue, as meanwhile testing 
this same command on other clusters works fine.

Here is the output of `pcs config`:

------
Cluster Name: d-gp2-dbpg35
Corosync Nodes:
 d-gp2-dbpg35-1 d-gp2-dbpg35-2 d-gp2-dbpg35-3
Pacemaker Nodes:
 d-gp2-dbpg35-1 d-gp2-dbpg35-2 d-gp2-dbpg35-3

Resources:
 Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.124.167.158 cidr_netmask=22
  Operations: start interval=0s timeout=20s 
(postgresql-master-vip-start-interval-0s)
              stop interval=0s timeout=20s 
(postgresql-master-vip-stop-interval-0s)
              monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
 Master: postgresql-ha
  Meta Attrs: notify=true 
  Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/lib/postgresql/10/bin 
pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 
recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c 
config_file=/etc/postgresql/10/main/postgresql.conf"
   Operations: start interval=0s timeout=60s 
(postgresql-10-main-start-interval-0s)
               stop interval=0s timeout=60s 
(postgresql-10-main-stop-interval-0s)
               promote interval=0s timeout=30s 
(postgresql-10-main-promote-interval-0s)
               demote interval=0s timeout=120s 
(postgresql-10-main-demote-interval-0s)
               monitor interval=15s role=Master timeout=10s 
(postgresql-10-main-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s 
(postgresql-10-main-monitor-interval-16s)
               notify interval=0s timeout=60s 
(postgresql-10-main-notify-interval-0s)

Stonith Devices:
 Resource: vfencing (class=stonith type=external/vcenter)
  Attributes: VI_SERVER=vcenter.imovetv.com 
VI_CREDSTORE=/etc/pacemaker/vicredentials.xml 
HOSTLIST=d-gp2-dbpg35-1;d-gp2-dbpg35-2;d-gp2-dbpg35-3 RESETPOWERON=1
  Operations: monitor interval=60s (vfencing-monitor-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote postgresql-ha then start postgresql-master-vip (kind:Mandatory) 
(non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
  demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory) 
(non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1)
Colocation Constraints:
  postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started) 
(with-rsc-role:Master) 
(id:colocation-postgresql-master-vip-postgresql-ha-INFINITY)

Resources Defaults:
 migration-threshold: 5
 resource-stickiness: 10
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: d-gp2-dbpg35
 dc-version: 1.1.14-70404b0
 have-watchdog: false
 stonith-enabled: false
Node Attributes:
 d-gp2-dbpg35-1: master-postgresql-10-main=1001
 d-gp2-dbpg35-2: master-postgresql-10-main=1000
 d-gp2-dbpg35-3: master-postgresql-10-main=990
------

Here is a failure of fence testing on the same cluster:

------
root@d-gp2-dbpg35-1:~# stonith_admin -FV d-gp2-dbpg35-3
Command failed: No route to host
------

For comparison sake, here is the output of `pcs config` on another cluster 
where the stonith_admin commands work:

------
Cluster Name: d-gp2-dbpg64
Corosync Nodes:
 d-gp2-dbpg64-1 d-gp2-dbpg64-2
Pacemaker Nodes:
 d-gp2-dbpg64-1 d-gp2-dbpg64-2

Resources:
 Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.124.165.40 cidr_netmask=22
  Operations: start interval=0s timeout=20s 
(postgresql-master-vip-start-interval-0s)
              stop interval=0s timeout=20s 
(postgresql-master-vip-stop-interval-0s)
              monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
 Master: postgresql-ha
  Meta Attrs: notify=true 
  Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/lib/postgresql/10/bin 
pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 
recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c 
config_file=/etc/postgresql/10/main/postgresql.conf"
   Operations: start interval=0s timeout=60s 
(postgresql-10-main-start-interval-0s)
               stop interval=0s timeout=60s 
(postgresql-10-main-stop-interval-0s)
               promote interval=0s timeout=30s 
(postgresql-10-main-promote-interval-0s)
               demote interval=0s timeout=120s 
(postgresql-10-main-demote-interval-0s)
               monitor interval=15s role=Master timeout=10s 
(postgresql-10-main-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s 
(postgresql-10-main-monitor-interval-16s)
               notify interval=0s timeout=60s 
(postgresql-10-main-notify-interval-0s)

Stonith Devices:
 Resource: vfencing (class=stonith type=external/vcenter)
  Attributes: VI_SERVER=vcenter.imovetv.com 
VI_CREDSTORE=/etc/pacemaker/vicredentials.xml 
HOSTLIST=d-gp2-dbpg64-1;d-gp2-dbpg64-2 RESETPOWERON=1
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote postgresql-ha then start postgresql-master-vip (kind:Mandatory) 
(non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
  demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory) 
(non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1)
Colocation Constraints:
  postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started) 
(with-rsc-role:Master) 
(id:colocation-postgresql-master-vip-postgresql-ha-INFINITY)

Resources Defaults:
 migration-threshold: 5
 resource-stickiness: 10
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: d-gp2-dbpg64
 dc-version: 1.1.14-70404b0
 have-watchdog: false
 last-lrm-refresh: 1527114792
 no-quorum-policy: ignore
 stonith-enabled: true
Node Attributes:
 d-gp2-dbpg64-1: master-postgresql-10-main=1001
 d-gp2-dbpg64-2: master-postgresql-10-main=1000
------

I have also verified that the username and password saved in 
/etc/pacemaker/vicredentials.xml file is identical, and the version of the 
vSphere CLI is identical between clusters.  I don't know how to test a vCLI 
command directly to rule out something related to that package, but hope that 
there is some way I can figure out what the stonith_admin command is in turn 
trying to execute to debug further.

Thank you in advance for any help,
-- 
Casey
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Need help debugging a STONITH resource

Reply via email to