Re: [ClusterLabs] Need help debugging a STONITH resource

Casey & Gina Wed, 11 Jul 2018 15:29:14 -0700

I was able to get this sorted out thanks to Ken's help on IRC.  For some 
reason, stonith_admin -L did not list the device I'd added until I set 
stonith_enabled=true, even though on other clusters this was not necessary.  My 
process was to ensure that stonith_admin could successfully fence/reboot a node 
in the cluster before enabling fencing in the pacemaker config.  So not sure 
why some times it registered and sometimes it didn't, but it seems that 
enabling stonith always registers it.


> On 2018-07-11, at 12:56 PM, Casey & Gina <[email protected]> wrote:
> 
> I have a number of clusters in a vmWare ESX environment which have all been 
> set up following the same steps, unless somehow I did something wrong on some 
> without realizing it.
> 
> The issue I am facing is that on some of the clusters, after adding the 
> STONITH resource, testing with `stonith_admin -F <node_hostname>` is failing 
> with the error "Command failed: No route to host".  Executing it with 
> --verbose adds no additional output.
> 
> The stonith plugin I am using is external/vcenter, which in turn utilizes the 
> vSphere CLI package.  I'm not certain what command it might be trying to run, 
> or how to debug this further...  It's not an ESX issue, as meanwhile testing 
> this same command on other clusters works fine.
> 
> Here is the output of `pcs config`:
> 
> ------
> Cluster Name: d-gp2-dbpg35
> Corosync Nodes:
> d-gp2-dbpg35-1 d-gp2-dbpg35-2 d-gp2-dbpg35-3
> Pacemaker Nodes:
> d-gp2-dbpg35-1 d-gp2-dbpg35-2 d-gp2-dbpg35-3
> 
> Resources:
> Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
>  Attributes: ip=10.124.167.158 cidr_netmask=22
>  Operations: start interval=0s timeout=20s 
> (postgresql-master-vip-start-interval-0s)
>              stop interval=0s timeout=20s 
> (postgresql-master-vip-stop-interval-0s)
>              monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
> Master: postgresql-ha
>  Meta Attrs: notify=true 
>  Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
>   Attributes: bindir=/usr/lib/postgresql/10/bin 
> pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 
> recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c 
> config_file=/etc/postgresql/10/main/postgresql.conf"
>   Operations: start interval=0s timeout=60s 
> (postgresql-10-main-start-interval-0s)
>               stop interval=0s timeout=60s 
> (postgresql-10-main-stop-interval-0s)
>               promote interval=0s timeout=30s 
> (postgresql-10-main-promote-interval-0s)
>               demote interval=0s timeout=120s 
> (postgresql-10-main-demote-interval-0s)
>               monitor interval=15s role=Master timeout=10s 
> (postgresql-10-main-monitor-interval-15s)
>               monitor interval=16s role=Slave timeout=10s 
> (postgresql-10-main-monitor-interval-16s)
>               notify interval=0s timeout=60s 
> (postgresql-10-main-notify-interval-0s)
> 
> Stonith Devices:
> Resource: vfencing (class=stonith type=external/vcenter)
>  Attributes: VI_SERVER=vcenter.imovetv.com 
> VI_CREDSTORE=/etc/pacemaker/vicredentials.xml 
> HOSTLIST=d-gp2-dbpg35-1;d-gp2-dbpg35-2;d-gp2-dbpg35-3 RESETPOWERON=1
>  Operations: monitor interval=60s (vfencing-monitor-60s)
> Fencing Levels:
> 
> Location Constraints:
> Ordering Constraints:
>  promote postgresql-ha then start postgresql-master-vip (kind:Mandatory) 
> (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
>  demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory) 
> (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1)
> Colocation Constraints:
>  postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started) 
> (with-rsc-role:Master) 
> (id:colocation-postgresql-master-vip-postgresql-ha-INFINITY)
> 
> Resources Defaults:
> migration-threshold: 5
> resource-stickiness: 10
> Operations Defaults:
> No defaults set
> 
> Cluster Properties:
> cluster-infrastructure: corosync
> cluster-name: d-gp2-dbpg35
> dc-version: 1.1.14-70404b0
> have-watchdog: false
> stonith-enabled: false
> Node Attributes:
> d-gp2-dbpg35-1: master-postgresql-10-main=1001
> d-gp2-dbpg35-2: master-postgresql-10-main=1000
> d-gp2-dbpg35-3: master-postgresql-10-main=990
> ------
> 
> Here is a failure of fence testing on the same cluster:
> 
> ------
> root@d-gp2-dbpg35-1:~# stonith_admin -FV d-gp2-dbpg35-3
> Command failed: No route to host
> ------
> 
> For comparison sake, here is the output of `pcs config` on another cluster 
> where the stonith_admin commands work:
> 
> ------
> Cluster Name: d-gp2-dbpg64
> Corosync Nodes:
> d-gp2-dbpg64-1 d-gp2-dbpg64-2
> Pacemaker Nodes:
> d-gp2-dbpg64-1 d-gp2-dbpg64-2
> 
> Resources:
> Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
>  Attributes: ip=10.124.165.40 cidr_netmask=22
>  Operations: start interval=0s timeout=20s 
> (postgresql-master-vip-start-interval-0s)
>              stop interval=0s timeout=20s 
> (postgresql-master-vip-stop-interval-0s)
>              monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
> Master: postgresql-ha
>  Meta Attrs: notify=true 
>  Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
>   Attributes: bindir=/usr/lib/postgresql/10/bin 
> pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 
> recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c 
> config_file=/etc/postgresql/10/main/postgresql.conf"
>   Operations: start interval=0s timeout=60s 
> (postgresql-10-main-start-interval-0s)
>               stop interval=0s timeout=60s 
> (postgresql-10-main-stop-interval-0s)
>               promote interval=0s timeout=30s 
> (postgresql-10-main-promote-interval-0s)
>               demote interval=0s timeout=120s 
> (postgresql-10-main-demote-interval-0s)
>               monitor interval=15s role=Master timeout=10s 
> (postgresql-10-main-monitor-interval-15s)
>               monitor interval=16s role=Slave timeout=10s 
> (postgresql-10-main-monitor-interval-16s)
>               notify interval=0s timeout=60s 
> (postgresql-10-main-notify-interval-0s)
> 
> Stonith Devices:
> Resource: vfencing (class=stonith type=external/vcenter)
>  Attributes: VI_SERVER=vcenter.imovetv.com 
> VI_CREDSTORE=/etc/pacemaker/vicredentials.xml 
> HOSTLIST=d-gp2-dbpg64-1;d-gp2-dbpg64-2 RESETPOWERON=1
> Fencing Levels:
> 
> Location Constraints:
> Ordering Constraints:
>  promote postgresql-ha then start postgresql-master-vip (kind:Mandatory) 
> (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
>  demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory) 
> (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1)
> Colocation Constraints:
>  postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started) 
> (with-rsc-role:Master) 
> (id:colocation-postgresql-master-vip-postgresql-ha-INFINITY)
> 
> Resources Defaults:
> migration-threshold: 5
> resource-stickiness: 10
> Operations Defaults:
> No defaults set
> 
> Cluster Properties:
> cluster-infrastructure: corosync
> cluster-name: d-gp2-dbpg64
> dc-version: 1.1.14-70404b0
> have-watchdog: false
> last-lrm-refresh: 1527114792
> no-quorum-policy: ignore
> stonith-enabled: true
> Node Attributes:
> d-gp2-dbpg64-1: master-postgresql-10-main=1001
> d-gp2-dbpg64-2: master-postgresql-10-main=1000
> ------
> 
> I have also verified that the username and password saved in 
> /etc/pacemaker/vicredentials.xml file is identical, and the version of the 
> vSphere CLI is identical between clusters.  I don't know how to test a vCLI 
> command directly to rule out something related to that package, but hope that 
> there is some way I can figure out what the stonith_admin command is in turn 
> trying to execute to debug further.
> 
> Thank you in advance for any help,
> -- 
> Casey
> _______________________________________________
> Users mailing list: [email protected]
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Need help debugging a STONITH resource

Reply via email to