I have a number of clusters in a vmWare ESX environment which have all been set
up following the same steps, unless somehow I did something wrong on some
without realizing it.
The issue I am facing is that on some of the clusters, after adding the STONITH
resource, testing with `stonith_admin -F <node_hostname>` is failing with the
error "Command failed: No route to host". Executing it with --verbose adds no
additional output.
The stonith plugin I am using is external/vcenter, which in turn utilizes the
vSphere CLI package. I'm not certain what command it might be trying to run,
or how to debug this further... It's not an ESX issue, as meanwhile testing
this same command on other clusters works fine.
Here is the output of `pcs config`:
------
Cluster Name: d-gp2-dbpg35
Corosync Nodes:
d-gp2-dbpg35-1 d-gp2-dbpg35-2 d-gp2-dbpg35-3
Pacemaker Nodes:
d-gp2-dbpg35-1 d-gp2-dbpg35-2 d-gp2-dbpg35-3
Resources:
Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=10.124.167.158 cidr_netmask=22
Operations: start interval=0s timeout=20s
(postgresql-master-vip-start-interval-0s)
stop interval=0s timeout=20s
(postgresql-master-vip-stop-interval-0s)
monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
Master: postgresql-ha
Meta Attrs: notify=true
Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
Attributes: bindir=/usr/lib/postgresql/10/bin
pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432
recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c
config_file=/etc/postgresql/10/main/postgresql.conf"
Operations: start interval=0s timeout=60s
(postgresql-10-main-start-interval-0s)
stop interval=0s timeout=60s
(postgresql-10-main-stop-interval-0s)
promote interval=0s timeout=30s
(postgresql-10-main-promote-interval-0s)
demote interval=0s timeout=120s
(postgresql-10-main-demote-interval-0s)
monitor interval=15s role=Master timeout=10s
(postgresql-10-main-monitor-interval-15s)
monitor interval=16s role=Slave timeout=10s
(postgresql-10-main-monitor-interval-16s)
notify interval=0s timeout=60s
(postgresql-10-main-notify-interval-0s)
Stonith Devices:
Resource: vfencing (class=stonith type=external/vcenter)
Attributes: VI_SERVER=vcenter.imovetv.com
VI_CREDSTORE=/etc/pacemaker/vicredentials.xml
HOSTLIST=d-gp2-dbpg35-1;d-gp2-dbpg35-2;d-gp2-dbpg35-3 RESETPOWERON=1
Operations: monitor interval=60s (vfencing-monitor-60s)
Fencing Levels:
Location Constraints:
Ordering Constraints:
promote postgresql-ha then start postgresql-master-vip (kind:Mandatory)
(non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory)
(non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1)
Colocation Constraints:
postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started)
(with-rsc-role:Master)
(id:colocation-postgresql-master-vip-postgresql-ha-INFINITY)
Resources Defaults:
migration-threshold: 5
resource-stickiness: 10
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: d-gp2-dbpg35
dc-version: 1.1.14-70404b0
have-watchdog: false
stonith-enabled: false
Node Attributes:
d-gp2-dbpg35-1: master-postgresql-10-main=1001
d-gp2-dbpg35-2: master-postgresql-10-main=1000
d-gp2-dbpg35-3: master-postgresql-10-main=990
------
Here is a failure of fence testing on the same cluster:
------
root@d-gp2-dbpg35-1:~# stonith_admin -FV d-gp2-dbpg35-3
Command failed: No route to host
------
For comparison sake, here is the output of `pcs config` on another cluster
where the stonith_admin commands work:
------
Cluster Name: d-gp2-dbpg64
Corosync Nodes:
d-gp2-dbpg64-1 d-gp2-dbpg64-2
Pacemaker Nodes:
d-gp2-dbpg64-1 d-gp2-dbpg64-2
Resources:
Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=10.124.165.40 cidr_netmask=22
Operations: start interval=0s timeout=20s
(postgresql-master-vip-start-interval-0s)
stop interval=0s timeout=20s
(postgresql-master-vip-stop-interval-0s)
monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
Master: postgresql-ha
Meta Attrs: notify=true
Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
Attributes: bindir=/usr/lib/postgresql/10/bin
pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432
recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c
config_file=/etc/postgresql/10/main/postgresql.conf"
Operations: start interval=0s timeout=60s
(postgresql-10-main-start-interval-0s)
stop interval=0s timeout=60s
(postgresql-10-main-stop-interval-0s)
promote interval=0s timeout=30s
(postgresql-10-main-promote-interval-0s)
demote interval=0s timeout=120s
(postgresql-10-main-demote-interval-0s)
monitor interval=15s role=Master timeout=10s
(postgresql-10-main-monitor-interval-15s)
monitor interval=16s role=Slave timeout=10s
(postgresql-10-main-monitor-interval-16s)
notify interval=0s timeout=60s
(postgresql-10-main-notify-interval-0s)
Stonith Devices:
Resource: vfencing (class=stonith type=external/vcenter)
Attributes: VI_SERVER=vcenter.imovetv.com
VI_CREDSTORE=/etc/pacemaker/vicredentials.xml
HOSTLIST=d-gp2-dbpg64-1;d-gp2-dbpg64-2 RESETPOWERON=1
Fencing Levels:
Location Constraints:
Ordering Constraints:
promote postgresql-ha then start postgresql-master-vip (kind:Mandatory)
(non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory)
(non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1)
Colocation Constraints:
postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started)
(with-rsc-role:Master)
(id:colocation-postgresql-master-vip-postgresql-ha-INFINITY)
Resources Defaults:
migration-threshold: 5
resource-stickiness: 10
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: d-gp2-dbpg64
dc-version: 1.1.14-70404b0
have-watchdog: false
last-lrm-refresh: 1527114792
no-quorum-policy: ignore
stonith-enabled: true
Node Attributes:
d-gp2-dbpg64-1: master-postgresql-10-main=1001
d-gp2-dbpg64-2: master-postgresql-10-main=1000
------
I have also verified that the username and password saved in
/etc/pacemaker/vicredentials.xml file is identical, and the version of the
vSphere CLI is identical between clusters. I don't know how to test a vCLI
command directly to rule out something related to that package, but hope that
there is some way I can figure out what the stonith_admin command is in turn
trying to execute to debug further.
Thank you in advance for any help,
--
Casey
_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org