Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Valentin Vidic
On Wed, Jul 11, 2018 at 04:31:31PM -0600, Casey & Gina wrote:
> Forgive me for interjecting, but how did you upgrade on Ubuntu?  I'm
> frustrated with limitations in 1.1.14 (particularly in PCS so not sure
> if it's relevant), and Ubuntu is ignoring my bug reports, so it would
> be great to upgrade if possible.  I'm using Ubuntu 16.04.

pcs is a single package in python and ruby so it should be possible to
try a newer version and see if it helps.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Casey & Gina
> After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and corosync 
> from 2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to repeat the same scenario 
> on Ubuntu 16.04.

Forgive me for interjecting, but how did you upgrade on Ubuntu?  I'm frustrated 
with limitations in 1.1.14 (particularly in PCS so not sure if it's relevant), 
and Ubuntu is ignoring my bug reports, so it would be great to upgrade if 
possible.  I'm using Ubuntu 16.04.

Best wishes,
-- 
Casey
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Need help debugging a STONITH resource

2018-07-11 Thread Casey & Gina
I was able to get this sorted out thanks to Ken's help on IRC.  For some 
reason, stonith_admin -L did not list the device I'd added until I set 
stonith_enabled=true, even though on other clusters this was not necessary.  My 
process was to ensure that stonith_admin could successfully fence/reboot a node 
in the cluster before enabling fencing in the pacemaker config.  So not sure 
why some times it registered and sometimes it didn't, but it seems that 
enabling stonith always registers it.

> On 2018-07-11, at 12:56 PM, Casey & Gina  wrote:
> 
> I have a number of clusters in a vmWare ESX environment which have all been 
> set up following the same steps, unless somehow I did something wrong on some 
> without realizing it.
> 
> The issue I am facing is that on some of the clusters, after adding the 
> STONITH resource, testing with `stonith_admin -F ` is failing 
> with the error "Command failed: No route to host".  Executing it with 
> --verbose adds no additional output.
> 
> The stonith plugin I am using is external/vcenter, which in turn utilizes the 
> vSphere CLI package.  I'm not certain what command it might be trying to run, 
> or how to debug this further...  It's not an ESX issue, as meanwhile testing 
> this same command on other clusters works fine.
> 
> Here is the output of `pcs config`:
> 
> --
> Cluster Name: d-gp2-dbpg35
> Corosync Nodes:
> d-gp2-dbpg35-1 d-gp2-dbpg35-2 d-gp2-dbpg35-3
> Pacemaker Nodes:
> d-gp2-dbpg35-1 d-gp2-dbpg35-2 d-gp2-dbpg35-3
> 
> Resources:
> Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
>  Attributes: ip=10.124.167.158 cidr_netmask=22
>  Operations: start interval=0s timeout=20s 
> (postgresql-master-vip-start-interval-0s)
>  stop interval=0s timeout=20s 
> (postgresql-master-vip-stop-interval-0s)
>  monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
> Master: postgresql-ha
>  Meta Attrs: notify=true 
>  Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
>   Attributes: bindir=/usr/lib/postgresql/10/bin 
> pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 
> recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c 
> config_file=/etc/postgresql/10/main/postgresql.conf"
>   Operations: start interval=0s timeout=60s 
> (postgresql-10-main-start-interval-0s)
>   stop interval=0s timeout=60s 
> (postgresql-10-main-stop-interval-0s)
>   promote interval=0s timeout=30s 
> (postgresql-10-main-promote-interval-0s)
>   demote interval=0s timeout=120s 
> (postgresql-10-main-demote-interval-0s)
>   monitor interval=15s role=Master timeout=10s 
> (postgresql-10-main-monitor-interval-15s)
>   monitor interval=16s role=Slave timeout=10s 
> (postgresql-10-main-monitor-interval-16s)
>   notify interval=0s timeout=60s 
> (postgresql-10-main-notify-interval-0s)
> 
> Stonith Devices:
> Resource: vfencing (class=stonith type=external/vcenter)
>  Attributes: VI_SERVER=vcenter.imovetv.com 
> VI_CREDSTORE=/etc/pacemaker/vicredentials.xml 
> HOSTLIST=d-gp2-dbpg35-1;d-gp2-dbpg35-2;d-gp2-dbpg35-3 RESETPOWERON=1
>  Operations: monitor interval=60s (vfencing-monitor-60s)
> Fencing Levels:
> 
> Location Constraints:
> Ordering Constraints:
>  promote postgresql-ha then start postgresql-master-vip (kind:Mandatory) 
> (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
>  demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory) 
> (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1)
> Colocation Constraints:
>  postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started) 
> (with-rsc-role:Master) 
> (id:colocation-postgresql-master-vip-postgresql-ha-INFINITY)
> 
> Resources Defaults:
> migration-threshold: 5
> resource-stickiness: 10
> Operations Defaults:
> No defaults set
> 
> Cluster Properties:
> cluster-infrastructure: corosync
> cluster-name: d-gp2-dbpg35
> dc-version: 1.1.14-70404b0
> have-watchdog: false
> stonith-enabled: false
> Node Attributes:
> d-gp2-dbpg35-1: master-postgresql-10-main=1001
> d-gp2-dbpg35-2: master-postgresql-10-main=1000
> d-gp2-dbpg35-3: master-postgresql-10-main=990
> --
> 
> Here is a failure of fence testing on the same cluster:
> 
> --
> root@d-gp2-dbpg35-1:~# stonith_admin -FV d-gp2-dbpg35-3
> Command failed: No route to host
> --
> 
> For comparison sake, here is the output of `pcs config` on another cluster 
> where the stonith_admin commands work:
> 
> --
> Cluster Name: d-gp2-dbpg64
> Corosync Nodes:
> d-gp2-dbpg64-1 d-gp2-dbpg64-2
> Pacemaker Nodes:
> d-gp2-dbpg64-1 d-gp2-dbpg64-2
> 
> Resources:
> Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
>  Attributes: ip=10.124.165.40 cidr_netmask=22
>  Operations: start interval=0s timeout=20s 
> (postgresql-master-vip-start-interval-0s)
>  stop interval=0s timeout=20s 
> (po

Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Salvatore D'angelo
Sorry replied too soon. 
Since disabling the update-rc.d command I assume the build process creates the 
services.
The only problem is that enabling them with systemctl does not work because it 
leverage on update-rc.d command that works only if LSB header container at 
least one run level.

For the moment the only fix I see is to  manipulate these init.d scripts by 
myself hoping they will be fixed in pacemaker/corosync.

> On 11 Jul 2018, at 23:18, Salvatore D'angelo  wrote:
> 
> Hi,
> 
> I solved the issue (I am not sure to be honest) simply removing the 
> update-rc.d command.
> I noticed I can start the corosync and pacemaker services with:
> 
> service corosync start
> service pacemaker start
> 
> I am not sure if they have been enabled at book (on Docker is not easy to 
> test).
> I do not know if pacemaker build creates automatically these services and 
> then it is required extra work to make them available at book.
> 
>> On 11 Jul 2018, at 21:07, Andrei Borzenkov > > wrote:
>> 
>> 11.07.2018 21:01, Salvatore D'angelo пишет:
>>> Yes, but doing what you suggested the system find that sysV is installed 
>>> and try to leverage on update-rc.d scripts and the failure occurs:
>> 
>> Then you built corosync without systemd integration. systemd will prefer
>> native units.
> 
> How can I build them with system integration?
> 
>> 
>>> 
>>> root@pg1:~# systemctl enable corosync
>>> corosync.service is not a native service, redirecting to 
>>> systemd-sysv-install
>>> Executing /lib/systemd/systemd-sysv-install enable corosync
>>> update-rc.d: error: corosync Default-Start contains no runlevels, aborting.
>>> 
>>> the only fix I found was to manipulate manually the header of 
>>> /etc/init.d/corosync adding the rows mentioned below.
>>> But this is not a clean approach to solve the issue.
>>> 
>>> What pacemaker suggest for newer distributions?
>>> 
>>> If you look at corosync code the init/corosync file does not container run 
>>> levels in header.
>>> So I suspect it is a code problem. Am I wrong?
>>> 
>> 
>> Probably not. Description of special comments in LSB standard imply that
>> they must contain at least one value. Also how should service manager
>> know for which run level to enable service without it? It is amusing
>> that this problem was first found on a distribution that does not even
>> use SysV for years …
> 
> What do you suggest?
> 
>> 
>> 
>> 
 On 11 Jul 2018, at 19:29, Ken Gaillot >>> > wrote:
 
 On Wed, 2018-07-11 at 18:43 +0200, Salvatore D'angelo wrote:
> Hi,
> 
> Yes that was clear to me, but question is pacemaker install
> /etc/init.d/pacemaker script but its header is not compatible with
> newer system that uses LSB.
> So if pacemaker creates scripts in /etc/init.d it should create them
> so that they are compatible with OS supported (not sure if Ubuntu is
> one).
> when I run “make install” anything is created for systemd env.
 
 With Ubuntu 16, you should use "systemctl enable pacemaker" instead of
 update-rc.d.
 
 The pacemaker configure script should have detected that the OS uses
 systemd and installed the appropriate unit file.
 
> I am not a SysV vs System expert, hoping I haven’t said anything
> wrong.
> 
>> On 11 Jul 2018, at 18:40, Andrei Borzenkov > >
>> wrote:
>> 
>> 11.07.2018 18:08, Salvatore D'angelo пишет:
>>> Hi All,
>>> 
>>> After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and
>>> corosync from 2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to
>>> repeat the same scenario on Ubuntu 16.04.
>> 
>> 16.04 is using systemd, you need to create systemd unit. I do not
>> know
>> if there is any compatibility layer to interpret upstart
>> configuration
>> like the one for sysvinit.
>> 
>>> As my previous scenario I am using Docker for test purpose before
>>> move to Bare metal.
>>> The scenario worked properly after I downloaded the correct
>>> dependencies versions.
>>> 
>>> The only problem I experienced is that in my procedure install I
>>> set corosync and pacemaker to run at startup updating the init.d
>>> scripts with this commands:
>>> 
>>> update-rc.d corosync defaults
>>> update-rc.d pacemaker defaults 80 80
>>> 
>>> I noticed that links in /etc/rc are not created.
>>> 
>>> I have also the following errors on second update-rc.d command:
>>> insserv: Service corosync has to be enabled to start service
>>> pacemaker
>>> insserv: exiting now!
>>> 
>>> I was able to solve the issue manually replacing these lines in
>>> /etc/init.d/corosync and /etc/init.d/pacemaker:
>>> # Default-Start:
>>> # Default-Stop:
>>> 
>>> with this:
>>> # Default-Start:2 3 4 5
>>> # Default-Stop: 0 1 6
>>> 
>>> I didn

Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Salvatore D'angelo
Hi,

I solved the issue (I am not sure to be honest) simply removing the update-rc.d 
command.
I noticed I can start the corosync and pacemaker services with:

service corosync start
service pacemaker start

I am not sure if they have been enabled at book (on Docker is not easy to test).
I do not know if pacemaker build creates automatically these services and then 
it is required extra work to make them available at book.

> On 11 Jul 2018, at 21:07, Andrei Borzenkov  wrote:
> 
> 11.07.2018 21:01, Salvatore D'angelo пишет:
>> Yes, but doing what you suggested the system find that sysV is installed and 
>> try to leverage on update-rc.d scripts and the failure occurs:
> 
> Then you built corosync without systemd integration. systemd will prefer
> native units.

How can I build them with system integration?

> 
>> 
>> root@pg1:~# systemctl enable corosync
>> corosync.service is not a native service, redirecting to systemd-sysv-install
>> Executing /lib/systemd/systemd-sysv-install enable corosync
>> update-rc.d: error: corosync Default-Start contains no runlevels, aborting.
>> 
>> the only fix I found was to manipulate manually the header of 
>> /etc/init.d/corosync adding the rows mentioned below.
>> But this is not a clean approach to solve the issue.
>> 
>> What pacemaker suggest for newer distributions?
>> 
>> If you look at corosync code the init/corosync file does not container run 
>> levels in header.
>> So I suspect it is a code problem. Am I wrong?
>> 
> 
> Probably not. Description of special comments in LSB standard imply that
> they must contain at least one value. Also how should service manager
> know for which run level to enable service without it? It is amusing
> that this problem was first found on a distribution that does not even
> use SysV for years …

What do you suggest?

> 
> 
> 
>>> On 11 Jul 2018, at 19:29, Ken Gaillot  wrote:
>>> 
>>> On Wed, 2018-07-11 at 18:43 +0200, Salvatore D'angelo wrote:
 Hi,
 
 Yes that was clear to me, but question is pacemaker install
 /etc/init.d/pacemaker script but its header is not compatible with
 newer system that uses LSB.
 So if pacemaker creates scripts in /etc/init.d it should create them
 so that they are compatible with OS supported (not sure if Ubuntu is
 one).
 when I run “make install” anything is created for systemd env.
>>> 
>>> With Ubuntu 16, you should use "systemctl enable pacemaker" instead of
>>> update-rc.d.
>>> 
>>> The pacemaker configure script should have detected that the OS uses
>>> systemd and installed the appropriate unit file.
>>> 
 I am not a SysV vs System expert, hoping I haven’t said anything
 wrong.
 
> On 11 Jul 2018, at 18:40, Andrei Borzenkov 
> wrote:
> 
> 11.07.2018 18:08, Salvatore D'angelo пишет:
>> Hi All,
>> 
>> After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and
>> corosync from 2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to
>> repeat the same scenario on Ubuntu 16.04.
> 
> 16.04 is using systemd, you need to create systemd unit. I do not
> know
> if there is any compatibility layer to interpret upstart
> configuration
> like the one for sysvinit.
> 
>> As my previous scenario I am using Docker for test purpose before
>> move to Bare metal.
>> The scenario worked properly after I downloaded the correct
>> dependencies versions.
>> 
>> The only problem I experienced is that in my procedure install I
>> set corosync and pacemaker to run at startup updating the init.d
>> scripts with this commands:
>> 
>> update-rc.d corosync defaults
>> update-rc.d pacemaker defaults 80 80
>> 
>> I noticed that links in /etc/rc are not created.
>> 
>> I have also the following errors on second update-rc.d command:
>> insserv: Service corosync has to be enabled to start service
>> pacemaker
>> insserv: exiting now!
>> 
>> I was able to solve the issue manually replacing these lines in
>> /etc/init.d/corosync and /etc/init.d/pacemaker:
>> # Default-Start:
>> # Default-Stop:
>> 
>> with this:
>> # Default-Start:2 3 4 5
>> # Default-Stop: 0 1 6
>> 
>> I didn’t understand if this is a bug of corosync or pacemaker or
>> simply there is a dependency missing on Ubuntu 16.04 that was
>> installed by default on 14.04. I found other discussion on this
>> forum about this problem but it’s not clear the solution.
>> Thanks in advance for support.
>> 
>> 
>> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
>> tch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> ___

Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Ken Gaillot
On Wed, 2018-07-11 at 22:07 +0300, Andrei Borzenkov wrote:
> 11.07.2018 21:01, Salvatore D'angelo пишет:
> > Yes, but doing what you suggested the system find that sysV is
> > installed and try to leverage on update-rc.d scripts and the
> > failure occurs:
> 
> Then you built corosync without systemd integration. systemd will
> prefer
> native units.
> 
> > 
> > root@pg1:~# systemctl enable corosync
> > corosync.service is not a native service, redirecting to systemd-
> > sysv-install
> > Executing /lib/systemd/systemd-sysv-install enable corosync
> > update-rc.d: error: corosync Default-Start contains no runlevels,
> > aborting.
> > 
> > the only fix I found was to manipulate manually the header of
> > /etc/init.d/corosync adding the rows mentioned below.
> > But this is not a clean approach to solve the issue.
> > 
> > What pacemaker suggest for newer distributions?

Red Hat (Fedora/RHEL/CentOS) and SuSE provide enterprise support for
pacemaker, and regularly contribute code upstream, so those and their
derivatives tend to be the most stable. Debian has a good team of
volunteers that greatly improved support in the current (stretch)
release, so I suppose Ubuntu will eventually pick that up. I know
people have compiled on Arch Linux, FreeBSD, OpenBSD, and likely
others, but that usually takes some extra work.

> > 
> > If you look at corosync code the init/corosync file does not
> > container run levels in header.
> > So I suspect it is a code problem. Am I wrong?
> > 
> 
> Probably not. Description of special comments in LSB standard imply
> that
> they must contain at least one value. Also how should service manager
> know for which run level to enable service without it? It is amusing
> that this problem was first found on a distribution that does not
> even
> use SysV for years ...

I'm not sure. Pacemaker packages are intended to be installed without
enabling at boot, since so much configuration must be done first. So
maybe the idea was to always require someone to specify run levels. But
it does make more sense that they would be listed in the LSB header.
One reason it wouldn't have been an issue before is some older distros
use the init script's chkconfig header instead of the LSB header.


> > > On 11 Jul 2018, at 19:29, Ken Gaillot 
> > > wrote:
> > > 
> > > On Wed, 2018-07-11 at 18:43 +0200, Salvatore D'angelo wrote:
> > > > Hi,
> > > > 
> > > > Yes that was clear to me, but question is pacemaker install
> > > > /etc/init.d/pacemaker script but its header is not compatible
> > > > with
> > > > newer system that uses LSB.
> > > > So if pacemaker creates scripts in /etc/init.d it should create
> > > > them
> > > > so that they are compatible with OS supported (not sure if
> > > > Ubuntu is
> > > > one).
> > > > when I run “make install” anything is created for systemd env.
> > > 
> > > With Ubuntu 16, you should use "systemctl enable pacemaker"
> > > instead of
> > > update-rc.d.
> > > 
> > > The pacemaker configure script should have detected that the OS
> > > uses
> > > systemd and installed the appropriate unit file.
> > > 
> > > > I am not a SysV vs System expert, hoping I haven’t said
> > > > anything
> > > > wrong.
> > > > 
> > > > > On 11 Jul 2018, at 18:40, Andrei Borzenkov  > > > > om>
> > > > > wrote:
> > > > > 
> > > > > 11.07.2018 18:08, Salvatore D'angelo пишет:
> > > > > > Hi All,
> > > > > > 
> > > > > > After I successfully upgraded Pacemaker from 1.1.14 to
> > > > > > 1.1.18 and
> > > > > > corosync from 2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying
> > > > > > to
> > > > > > repeat the same scenario on Ubuntu 16.04.
> > > > > 
> > > > > 16.04 is using systemd, you need to create systemd unit. I do
> > > > > not
> > > > > know
> > > > > if there is any compatibility layer to interpret upstart
> > > > > configuration
> > > > > like the one for sysvinit.
> > > > > 
> > > > > > As my previous scenario I am using Docker for test purpose
> > > > > > before
> > > > > > move to Bare metal.
> > > > > > The scenario worked properly after I downloaded the correct
> > > > > > dependencies versions.
> > > > > > 
> > > > > > The only problem I experienced is that in my procedure
> > > > > > install I
> > > > > > set corosync and pacemaker to run at startup updating the
> > > > > > init.d
> > > > > > scripts with this commands:
> > > > > > 
> > > > > > update-rc.d corosync defaults
> > > > > > update-rc.d pacemaker defaults 80 80
> > > > > > 
> > > > > > I noticed that links in /etc/rc are not created.
> > > > > > 
> > > > > > I have also the following errors on second update-rc.d
> > > > > > command:
> > > > > > insserv: Service corosync has to be enabled to start
> > > > > > service
> > > > > > pacemaker
> > > > > > insserv: exiting now!
> > > > > > 
> > > > > > I was able to solve the issue manually replacing these
> > > > > > lines in
> > > > > > /etc/init.d/corosync and /etc/init.d/pacemaker:
> > > > > > # Default-Start:
> > > > > > # Default-Stop:
> > > > > > 
> > > > > > with this:
> > > >

Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Valentin Vidic
On Wed, Jul 11, 2018 at 08:01:46PM +0200, Salvatore D'angelo wrote:
> Yes, but doing what you suggested the system find that sysV is
> installed and try to leverage on update-rc.d scripts and the failure
> occurs:
> 
> root@pg1:~# systemctl enable corosync
> corosync.service is not a native service, redirecting to systemd-sysv-install
> Executing /lib/systemd/systemd-sysv-install enable corosync
> update-rc.d: error: corosync Default-Start contains no runlevels, aborting.
> 
> the only fix I found was to manipulate manually the header of
> /etc/init.d/corosync adding the rows mentioned below.
> But this is not a clean approach to solve the issue.
> 
> What pacemaker suggest for newer distributions?

You can try using init scripts from the Debian/Ubuntu packages for
corosync and pacemaker as they have the runlevel info included.

Another option is to get the systemd service files working and than
remove the init scripts.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Andrei Borzenkov
11.07.2018 21:01, Salvatore D'angelo пишет:
> Yes, but doing what you suggested the system find that sysV is installed and 
> try to leverage on update-rc.d scripts and the failure occurs:

Then you built corosync without systemd integration. systemd will prefer
native units.

> 
> root@pg1:~# systemctl enable corosync
> corosync.service is not a native service, redirecting to systemd-sysv-install
> Executing /lib/systemd/systemd-sysv-install enable corosync
> update-rc.d: error: corosync Default-Start contains no runlevels, aborting.
> 
> the only fix I found was to manipulate manually the header of 
> /etc/init.d/corosync adding the rows mentioned below.
> But this is not a clean approach to solve the issue.
> 
> What pacemaker suggest for newer distributions?
> 
> If you look at corosync code the init/corosync file does not container run 
> levels in header.
> So I suspect it is a code problem. Am I wrong?
> 

Probably not. Description of special comments in LSB standard imply that
they must contain at least one value. Also how should service manager
know for which run level to enable service without it? It is amusing
that this problem was first found on a distribution that does not even
use SysV for years ...



>> On 11 Jul 2018, at 19:29, Ken Gaillot  wrote:
>>
>> On Wed, 2018-07-11 at 18:43 +0200, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> Yes that was clear to me, but question is pacemaker install
>>> /etc/init.d/pacemaker script but its header is not compatible with
>>> newer system that uses LSB.
>>> So if pacemaker creates scripts in /etc/init.d it should create them
>>> so that they are compatible with OS supported (not sure if Ubuntu is
>>> one).
>>> when I run “make install” anything is created for systemd env.
>>
>> With Ubuntu 16, you should use "systemctl enable pacemaker" instead of
>> update-rc.d.
>>
>> The pacemaker configure script should have detected that the OS uses
>> systemd and installed the appropriate unit file.
>>
>>> I am not a SysV vs System expert, hoping I haven’t said anything
>>> wrong.
>>>
 On 11 Jul 2018, at 18:40, Andrei Borzenkov 
 wrote:

 11.07.2018 18:08, Salvatore D'angelo пишет:
> Hi All,
>
> After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and
> corosync from 2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to
> repeat the same scenario on Ubuntu 16.04.

 16.04 is using systemd, you need to create systemd unit. I do not
 know
 if there is any compatibility layer to interpret upstart
 configuration
 like the one for sysvinit.

> As my previous scenario I am using Docker for test purpose before
> move to Bare metal.
> The scenario worked properly after I downloaded the correct
> dependencies versions.
>
> The only problem I experienced is that in my procedure install I
> set corosync and pacemaker to run at startup updating the init.d
> scripts with this commands:
>
> update-rc.d corosync defaults
> update-rc.d pacemaker defaults 80 80
>
> I noticed that links in /etc/rc are not created.
>
> I have also the following errors on second update-rc.d command:
> insserv: Service corosync has to be enabled to start service
> pacemaker
> insserv: exiting now!
>
> I was able to solve the issue manually replacing these lines in
> /etc/init.d/corosync and /etc/init.d/pacemaker:
> # Default-Start:
> # Default-Stop:
>
> with this:
> # Default-Start:2 3 4 5
> # Default-Stop: 0 1 6
>
> I didn’t understand if this is a bug of corosync or pacemaker or
> simply there is a dependency missing on Ubuntu 16.04 that was
> installed by default on 14.04. I found other discussion on this
> forum about this problem but it’s not clear the solution.
> Thanks in advance for support.
>
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
> tch.pdf
> Bugs: http://bugs.clusterlabs.org
>

 ___
 Users mailing list: Users@clusterlabs.org
 https://lists.clusterlabs.org/mailman/listinfo/users

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
 h.pdf
 Bugs: http://bugs.clusterlabs.org
>>>
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>>> pdf
>>> Bugs: http://bugs.clusterlabs.org
>> -- 
>> Ken Gaillot 
>> ___
>> User

[ClusterLabs] Need help debugging a STONITH resource

2018-07-11 Thread Casey & Gina
I have a number of clusters in a vmWare ESX environment which have all been set 
up following the same steps, unless somehow I did something wrong on some 
without realizing it.

The issue I am facing is that on some of the clusters, after adding the STONITH 
resource, testing with `stonith_admin -F ` is failing with the 
error "Command failed: No route to host".  Executing it with --verbose adds no 
additional output.

The stonith plugin I am using is external/vcenter, which in turn utilizes the 
vSphere CLI package.  I'm not certain what command it might be trying to run, 
or how to debug this further...  It's not an ESX issue, as meanwhile testing 
this same command on other clusters works fine.

Here is the output of `pcs config`:

--
Cluster Name: d-gp2-dbpg35
Corosync Nodes:
 d-gp2-dbpg35-1 d-gp2-dbpg35-2 d-gp2-dbpg35-3
Pacemaker Nodes:
 d-gp2-dbpg35-1 d-gp2-dbpg35-2 d-gp2-dbpg35-3

Resources:
 Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.124.167.158 cidr_netmask=22
  Operations: start interval=0s timeout=20s 
(postgresql-master-vip-start-interval-0s)
  stop interval=0s timeout=20s 
(postgresql-master-vip-stop-interval-0s)
  monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
 Master: postgresql-ha
  Meta Attrs: notify=true 
  Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/lib/postgresql/10/bin 
pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 
recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c 
config_file=/etc/postgresql/10/main/postgresql.conf"
   Operations: start interval=0s timeout=60s 
(postgresql-10-main-start-interval-0s)
   stop interval=0s timeout=60s 
(postgresql-10-main-stop-interval-0s)
   promote interval=0s timeout=30s 
(postgresql-10-main-promote-interval-0s)
   demote interval=0s timeout=120s 
(postgresql-10-main-demote-interval-0s)
   monitor interval=15s role=Master timeout=10s 
(postgresql-10-main-monitor-interval-15s)
   monitor interval=16s role=Slave timeout=10s 
(postgresql-10-main-monitor-interval-16s)
   notify interval=0s timeout=60s 
(postgresql-10-main-notify-interval-0s)

Stonith Devices:
 Resource: vfencing (class=stonith type=external/vcenter)
  Attributes: VI_SERVER=vcenter.imovetv.com 
VI_CREDSTORE=/etc/pacemaker/vicredentials.xml 
HOSTLIST=d-gp2-dbpg35-1;d-gp2-dbpg35-2;d-gp2-dbpg35-3 RESETPOWERON=1
  Operations: monitor interval=60s (vfencing-monitor-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote postgresql-ha then start postgresql-master-vip (kind:Mandatory) 
(non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
  demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory) 
(non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1)
Colocation Constraints:
  postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started) 
(with-rsc-role:Master) 
(id:colocation-postgresql-master-vip-postgresql-ha-INFINITY)

Resources Defaults:
 migration-threshold: 5
 resource-stickiness: 10
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: d-gp2-dbpg35
 dc-version: 1.1.14-70404b0
 have-watchdog: false
 stonith-enabled: false
Node Attributes:
 d-gp2-dbpg35-1: master-postgresql-10-main=1001
 d-gp2-dbpg35-2: master-postgresql-10-main=1000
 d-gp2-dbpg35-3: master-postgresql-10-main=990
--

Here is a failure of fence testing on the same cluster:

--
root@d-gp2-dbpg35-1:~# stonith_admin -FV d-gp2-dbpg35-3
Command failed: No route to host
--

For comparison sake, here is the output of `pcs config` on another cluster 
where the stonith_admin commands work:

--
Cluster Name: d-gp2-dbpg64
Corosync Nodes:
 d-gp2-dbpg64-1 d-gp2-dbpg64-2
Pacemaker Nodes:
 d-gp2-dbpg64-1 d-gp2-dbpg64-2

Resources:
 Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.124.165.40 cidr_netmask=22
  Operations: start interval=0s timeout=20s 
(postgresql-master-vip-start-interval-0s)
  stop interval=0s timeout=20s 
(postgresql-master-vip-stop-interval-0s)
  monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
 Master: postgresql-ha
  Meta Attrs: notify=true 
  Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/lib/postgresql/10/bin 
pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 
recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c 
config_file=/etc/postgresql/10/main/postgresql.conf"
   Operations: start interval=0s timeout=60s 
(postgresql-10-main-start-interval-0s)
   stop interval=0s timeout=60s 
(postgresql-10-main-stop-interval-0s)
   promote interval=0s timeout=30s 
(postgresql-10-main-promote-in

Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Salvatore D'angelo
Yes, but doing what you suggested the system find that sysV is installed and 
try to leverage on update-rc.d scripts and the failure occurs:

root@pg1:~# systemctl enable corosync
corosync.service is not a native service, redirecting to systemd-sysv-install
Executing /lib/systemd/systemd-sysv-install enable corosync
update-rc.d: error: corosync Default-Start contains no runlevels, aborting.

the only fix I found was to manipulate manually the header of 
/etc/init.d/corosync adding the rows mentioned below.
But this is not a clean approach to solve the issue.

What pacemaker suggest for newer distributions?

If you look at corosync code the init/corosync file does not container run 
levels in header.
So I suspect it is a code problem. Am I wrong?

> On 11 Jul 2018, at 19:29, Ken Gaillot  wrote:
> 
> On Wed, 2018-07-11 at 18:43 +0200, Salvatore D'angelo wrote:
>> Hi,
>> 
>> Yes that was clear to me, but question is pacemaker install
>> /etc/init.d/pacemaker script but its header is not compatible with
>> newer system that uses LSB.
>> So if pacemaker creates scripts in /etc/init.d it should create them
>> so that they are compatible with OS supported (not sure if Ubuntu is
>> one).
>> when I run “make install” anything is created for systemd env.
> 
> With Ubuntu 16, you should use "systemctl enable pacemaker" instead of
> update-rc.d.
> 
> The pacemaker configure script should have detected that the OS uses
> systemd and installed the appropriate unit file.
> 
>> I am not a SysV vs System expert, hoping I haven’t said anything
>> wrong.
>> 
>>> On 11 Jul 2018, at 18:40, Andrei Borzenkov 
>>> wrote:
>>> 
>>> 11.07.2018 18:08, Salvatore D'angelo пишет:
 Hi All,
 
 After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and
 corosync from 2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to
 repeat the same scenario on Ubuntu 16.04.
>>> 
>>> 16.04 is using systemd, you need to create systemd unit. I do not
>>> know
>>> if there is any compatibility layer to interpret upstart
>>> configuration
>>> like the one for sysvinit.
>>> 
 As my previous scenario I am using Docker for test purpose before
 move to Bare metal.
 The scenario worked properly after I downloaded the correct
 dependencies versions.
 
 The only problem I experienced is that in my procedure install I
 set corosync and pacemaker to run at startup updating the init.d
 scripts with this commands:
 
 update-rc.d corosync defaults
 update-rc.d pacemaker defaults 80 80
 
 I noticed that links in /etc/rc are not created.
 
 I have also the following errors on second update-rc.d command:
 insserv: Service corosync has to be enabled to start service
 pacemaker
 insserv: exiting now!
 
 I was able to solve the issue manually replacing these lines in
 /etc/init.d/corosync and /etc/init.d/pacemaker:
 # Default-Start:
 # Default-Stop:
 
 with this:
 # Default-Start:2 3 4 5
 # Default-Stop: 0 1 6
 
 I didn’t understand if this is a bug of corosync or pacemaker or
 simply there is a dependency missing on Ubuntu 16.04 that was
 installed by default on 14.04. I found other discussion on this
 forum about this problem but it’s not clear the solution.
 Thanks in advance for support.
 
 
 
 
 ___
 Users mailing list: Users@clusterlabs.org
 https://lists.clusterlabs.org/mailman/listinfo/users
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
 tch.pdf
 Bugs: http://bugs.clusterlabs.org
 
>>> 
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
>>> h.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> pdf
>> Bugs: http://bugs.clusterlabs.org
> -- 
> Ken Gaillot 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Ken Gaillot
On Wed, 2018-07-11 at 18:43 +0200, Salvatore D'angelo wrote:
> Hi,
> 
> Yes that was clear to me, but question is pacemaker install
> /etc/init.d/pacemaker script but its header is not compatible with
> newer system that uses LSB.
> So if pacemaker creates scripts in /etc/init.d it should create them
> so that they are compatible with OS supported (not sure if Ubuntu is
> one).
> when I run “make install” anything is created for systemd env.

With Ubuntu 16, you should use "systemctl enable pacemaker" instead of
update-rc.d.

The pacemaker configure script should have detected that the OS uses
systemd and installed the appropriate unit file.

> I am not a SysV vs System expert, hoping I haven’t said anything
> wrong.
> 
> > On 11 Jul 2018, at 18:40, Andrei Borzenkov 
> > wrote:
> > 
> > 11.07.2018 18:08, Salvatore D'angelo пишет:
> > > Hi All,
> > > 
> > > After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and
> > > corosync from 2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to
> > > repeat the same scenario on Ubuntu 16.04.
> > 
> > 16.04 is using systemd, you need to create systemd unit. I do not
> > know
> > if there is any compatibility layer to interpret upstart
> > configuration
> > like the one for sysvinit.
> > 
> > > As my previous scenario I am using Docker for test purpose before
> > > move to Bare metal.
> > > The scenario worked properly after I downloaded the correct
> > > dependencies versions.
> > > 
> > > The only problem I experienced is that in my procedure install I
> > > set corosync and pacemaker to run at startup updating the init.d
> > > scripts with this commands:
> > > 
> > > update-rc.d corosync defaults
> > > update-rc.d pacemaker defaults 80 80
> > > 
> > > I noticed that links in /etc/rc are not created.
> > > 
> > > I have also the following errors on second update-rc.d command:
> > > insserv: Service corosync has to be enabled to start service
> > > pacemaker
> > > insserv: exiting now!
> > > 
> > > I was able to solve the issue manually replacing these lines in
> > > /etc/init.d/corosync and /etc/init.d/pacemaker:
> > > # Default-Start:
> > > # Default-Stop:
> > > 
> > > with this:
> > > # Default-Start:    2 3 4 5
> > > # Default-Stop: 0 1 6
> > > 
> > > I didn’t understand if this is a bug of corosync or pacemaker or
> > > simply there is a dependency missing on Ubuntu 16.04 that was
> > > installed by default on 14.04. I found other discussion on this
> > > forum about this problem but it’s not clear the solution.
> > > Thanks in advance for support.
> > > 
> > > 
> > > 
> > > 
> > > ___
> > > Users mailing list: Users@clusterlabs.org
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > 
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
> > > tch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > > 
> > 
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Salvatore D'angelo
Thank you. It's clear now.

Il Mer 11 Lug 2018, 7:18 PM Andrei Borzenkov  ha
scritto:

> 11.07.2018 20:12, Salvatore D'angelo пишет:
> > Does this mean that if STONITH resource p_ston_pg1 even if it runs on
> node pg2 if pacemaker send a signal to it pg1 is powered of and not pg2.
> > Am I correct?
>
> Yes. Resource will be used to power off whatever hosts are listed in its
> pcmk_host_list. It is totally unrelated to where it is active currently.
>
> >
> >> On 11 Jul 2018, at 19:10, Andrei Borzenkov  wrote:
> >>
> >> 11.07.2018 19:44, Salvatore D'angelo пишет:
> >>> Hi all,
> >>>
> >>> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are
> not correctly located:
> >>
> >> Actual location of stonith resources does not really matter in up to
> >> date pacemaker. It only determines where resource will be monitored;
> >> resource will be used by whatever node will be selected to perform
> stonith.
> >>
> >> The only requirement is that stonith resource is not prohibited from
> >> running on node by constraints.
> >>
> >>> p_ston_pg1  (stonith:external/ipmi):Started pg2
> >>> p_ston_pg2  (stonith:external/ipmi):Started pg1
> >>> p_ston_pg3  (stonith:external/ipmi):Started pg1
> >>>
> >>> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3
> (10.0.0.13). I expected p_ston_pg3 was running on pg3, but I see it on pg1.
> >>>
> >>> Here my configuration:
> >>> primitive p_ston_pg1 stonith:external/ipmi \\
> >>> params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list
> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass"
> passwd_method=file interface=lan priv=OPERATOR
> >>> primitive p_ston_pg2 stonith:external/ipmi \\
> >>> params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list
> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass"
> passwd_method=file interface=lan priv=OPERATOR
> >>> primitive p_ston_pg3 stonith:external/ipmi \\
> >>> params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list
> ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass"
> passwd_method=file interface=lan priv=OPERATOR
> >>>
> >>> location l_ston_pg1 p_ston_pg1 -inf: pg1
> >>> location l_ston_pg2 p_ston_pg2 -inf: pg2
> >>> location l_ston_pg3 p_ston_pg3 -inf: pg3
> >>>
> >>> this seems work fine on bare metal.
> >>> Any suggestion what could be root cause?
> >>>
> >>
> >> Root cause of what? Locations match your constraints.
> >> ___
> >> Users mailing list: Users@clusterlabs.org
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Andrei Borzenkov
11.07.2018 20:12, Salvatore D'angelo пишет:
> Does this mean that if STONITH resource p_ston_pg1 even if it runs on node 
> pg2 if pacemaker send a signal to it pg1 is powered of and not pg2.
> Am I correct?

Yes. Resource will be used to power off whatever hosts are listed in its
pcmk_host_list. It is totally unrelated to where it is active currently.

> 
>> On 11 Jul 2018, at 19:10, Andrei Borzenkov  wrote:
>>
>> 11.07.2018 19:44, Salvatore D'angelo пишет:
>>> Hi all,
>>>
>>> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are not 
>>> correctly located:
>>
>> Actual location of stonith resources does not really matter in up to
>> date pacemaker. It only determines where resource will be monitored;
>> resource will be used by whatever node will be selected to perform stonith.
>>
>> The only requirement is that stonith resource is not prohibited from
>> running on node by constraints.
>>
>>> p_ston_pg1  (stonith:external/ipmi):Started pg2
>>> p_ston_pg2  (stonith:external/ipmi):Started pg1
>>> p_ston_pg3  (stonith:external/ipmi):Started pg1
>>>
>>> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3 (10.0.0.13). I 
>>> expected p_ston_pg3 was running on pg3, but I see it on pg1.
>>>
>>> Here my configuration:
>>> primitive p_ston_pg1 stonith:external/ipmi \\
>>> params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list 
>>> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass" 
>>> passwd_method=file interface=lan priv=OPERATOR
>>> primitive p_ston_pg2 stonith:external/ipmi \\
>>> params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list 
>>> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass" 
>>> passwd_method=file interface=lan priv=OPERATOR
>>> primitive p_ston_pg3 stonith:external/ipmi \\
>>> params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list 
>>> ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass" 
>>> passwd_method=file interface=lan priv=OPERATOR
>>>
>>> location l_ston_pg1 p_ston_pg1 -inf: pg1
>>> location l_ston_pg2 p_ston_pg2 -inf: pg2
>>> location l_ston_pg3 p_ston_pg3 -inf: pg3
>>>
>>> this seems work fine on bare metal.
>>> Any suggestion what could be root cause?
>>>
>>
>> Root cause of what? Locations match your constraints.
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Salvatore D'angelo
Does this mean that if STONITH resource p_ston_pg1 even if it runs on node pg2 
if pacemaker send a signal to it pg1 is powered of and not pg2.
Am I correct?

> On 11 Jul 2018, at 19:10, Andrei Borzenkov  wrote:
> 
> 11.07.2018 19:44, Salvatore D'angelo пишет:
>> Hi all,
>> 
>> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are not 
>> correctly located:
> 
> Actual location of stonith resources does not really matter in up to
> date pacemaker. It only determines where resource will be monitored;
> resource will be used by whatever node will be selected to perform stonith.
> 
> The only requirement is that stonith resource is not prohibited from
> running on node by constraints.
> 
>> p_ston_pg1   (stonith:external/ipmi):Started pg2
>> p_ston_pg2   (stonith:external/ipmi):Started pg1
>> p_ston_pg3   (stonith:external/ipmi):Started pg1
>> 
>> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3 (10.0.0.13). I 
>> expected p_ston_pg3 was running on pg3, but I see it on pg1.
>> 
>> Here my configuration:
>> primitive p_ston_pg1 stonith:external/ipmi \\
>>  params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list 
>> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass" 
>> passwd_method=file interface=lan priv=OPERATOR
>> primitive p_ston_pg2 stonith:external/ipmi \\
>>  params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list 
>> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass" 
>> passwd_method=file interface=lan priv=OPERATOR
>> primitive p_ston_pg3 stonith:external/ipmi \\
>>  params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list 
>> ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass" 
>> passwd_method=file interface=lan priv=OPERATOR
>> 
>> location l_ston_pg1 p_ston_pg1 -inf: pg1
>> location l_ston_pg2 p_ston_pg2 -inf: pg2
>> location l_ston_pg3 p_ston_pg3 -inf: pg3
>> 
>> this seems work fine on bare metal.
>> Any suggestion what could be root cause?
>> 
> 
> Root cause of what? Locations match your constraints.
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Andrei Borzenkov
11.07.2018 19:44, Salvatore D'angelo пишет:
> Hi all,
> 
> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are not 
> correctly located:

Actual location of stonith resources does not really matter in up to
date pacemaker. It only determines where resource will be monitored;
resource will be used by whatever node will be selected to perform stonith.

The only requirement is that stonith resource is not prohibited from
running on node by constraints.

> p_ston_pg1(stonith:external/ipmi):Started pg2
> p_ston_pg2(stonith:external/ipmi):Started pg1
> p_ston_pg3(stonith:external/ipmi):Started pg1
> 
> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3 (10.0.0.13). I 
> expected p_ston_pg3 was running on pg3, but I see it on pg1.
> 
> Here my configuration:
> primitive p_ston_pg1 stonith:external/ipmi \\
>   params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list 
> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> primitive p_ston_pg2 stonith:external/ipmi \\
>   params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list 
> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> primitive p_ston_pg3 stonith:external/ipmi \\
>   params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list 
> ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> 
> location l_ston_pg1 p_ston_pg1 -inf: pg1
> location l_ston_pg2 p_ston_pg2 -inf: pg2
> location l_ston_pg3 p_ston_pg3 -inf: pg3
> 
> this seems work fine on bare metal.
> Any suggestion what could be root cause?
> 

Root cause of what? Locations match your constraints.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Salvatore D'angelo
Suppose I do the following:

crm configure delete l_ston_pg1
crm configure delete l_ston_pg2
crm configure delete l_ston_pg3
crm configure location l_ston_pg1 p_ston_pg1 inf: pg1
crm configure location l_ston_pg2 p_ston_pg2 inf: pg2
crm configure location l_ston_pg3 p_ston_pg3 inf: pg3

How long should I wait to see each STONITH resource on the correct node? 
Should I do something to adjust things on the fly?
Thanks for support.

> On 11 Jul 2018, at 18:47, Emmanuel Gelati  wrote:
> 
> You need to use location l_ston_pg3 p_ston_pg3 inf: pg3, because -inf is 
> negative.
> 
> 2018-07-11 18:44 GMT+02:00 Salvatore D'angelo  >:
> Hi all,
> 
> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are not 
> correctly located:
> p_ston_pg1(stonith:external/ipmi):Started pg2
> p_ston_pg2(stonith:external/ipmi):Started pg1
> p_ston_pg3(stonith:external/ipmi):Started pg1
> 
> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3 (10.0.0.13). I 
> expected p_ston_pg3 was running on pg3, but I see it on pg1.
> 
> Here my configuration:
> primitive p_ston_pg1 stonith:external/ipmi \\
>   params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list 
> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> primitive p_ston_pg2 stonith:external/ipmi \\
>   params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list 
> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> primitive p_ston_pg3 stonith:external/ipmi \\
>   params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list 
> ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> 
> location l_ston_pg1 p_ston_pg1 -inf: pg1
> location l_ston_pg2 p_ston_pg2 -inf: pg2
> location l_ston_pg3 p_ston_pg3 -inf: pg3
> 
> this seems work fine on bare metal.
> Any suggestion what could be root cause?
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> 
> Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> 
> -- 
>   .~.
>   /V\
>  //  \\
> /(   )\
> ^`~'^
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Emmanuel Gelati
You need to use location l_ston_pg3 p_ston_pg3 inf: pg3, because -inf is
negative.

2018-07-11 18:44 GMT+02:00 Salvatore D'angelo :

> Hi all,
>
> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are not
> correctly located:
> p_ston_pg1 (stonith:external/ipmi): Started pg2
> p_ston_pg2 (stonith:external/ipmi): Started pg1
> p_ston_pg3 (stonith:external/ipmi): Started pg1
>
> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3 (10.0.0.13).
> I expected p_ston_pg3 was running on pg3, but I see it on pg1.
>
> Here my configuration:
> primitive p_ston_pg1 stonith:external/ipmi \\
> params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list
> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass"
> passwd_method=file interface=lan priv=OPERATOR
> primitive p_ston_pg2 stonith:external/ipmi \\
> params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list
> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass"
> passwd_method=file interface=lan priv=OPERATOR
> primitive p_ston_pg3 stonith:external/ipmi \\
> params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list ipaddr=
> 10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass" passwd_method=file
> interface=lan priv=OPERATOR
>
> location l_ston_pg1 p_ston_pg1 -inf: pg1
> location l_ston_pg2 p_ston_pg2 -inf: pg2
> location l_ston_pg3 p_ston_pg3 -inf: pg3
>
> this seems work fine on bare metal.
> Any suggestion what could be root cause?
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Salvatore D'angelo
Hi all,

in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are not 
correctly located:
p_ston_pg1  (stonith:external/ipmi):Started pg2
p_ston_pg2  (stonith:external/ipmi):Started pg1
p_ston_pg3  (stonith:external/ipmi):Started pg1

I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3 (10.0.0.13). I 
expected p_ston_pg3 was running on pg3, but I see it on pg1.

Here my configuration:
primitive p_ston_pg1 stonith:external/ipmi \\
params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list 
ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass" passwd_method=file 
interface=lan priv=OPERATOR
primitive p_ston_pg2 stonith:external/ipmi \\
params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list 
ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass" passwd_method=file 
interface=lan priv=OPERATOR
primitive p_ston_pg3 stonith:external/ipmi \\
params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list 
ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass" passwd_method=file 
interface=lan priv=OPERATOR

location l_ston_pg1 p_ston_pg1 -inf: pg1
location l_ston_pg2 p_ston_pg2 -inf: pg2
location l_ston_pg3 p_ston_pg3 -inf: pg3

this seems work fine on bare metal.
Any suggestion what could be root cause?


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker 1.1.19 released

2018-07-11 Thread Ken Gaillot
Source code for the final release of Pacemaker version 1.1.19 is
available at:

https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.19

This is a maintenance release that backports selected fixes and
features from the 2.0.0 version. The 1.1 series is no longer actively
maintained, but is still supported for users (and distributions) who
want to keep support for features dropped by the 2.0 series (such as
CMAN or heartbeat as the cluster layer).

The most significant changes in this release are:

* stonith_admin has a new --validate option to validate option
configurations

* .mount, .path, and .timer unit files are now supported as "systemd:"-
class resources

* 5 regressions in 1.1.17 and 1.1.18 have been fixed

For a more detailed list of bug fixes and other changes, see the change
log:

https://github.com/ClusterLabs/pacemaker/blob/1.1/ChangeLog

Many thanks to all contributors of source code to this release,
including Andrew Beekhof, Gao,Yan, Hideo Yamauchi, Jan Pokorný, Ken
Gaillot, and Klaus Wenninger.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Salvatore D'angelo
Hi,

Yes that was clear to me, but question is pacemaker install 
/etc/init.d/pacemaker script but its header is not compatible with newer system 
that uses LSB.
So if pacemaker creates scripts in /etc/init.d it should create them so that 
they are compatible with OS supported (not sure if Ubuntu is one).
when I run “make install” anything is created for systemd env.

I am not a SysV vs System expert, hoping I haven’t said anything wrong.

> On 11 Jul 2018, at 18:40, Andrei Borzenkov  wrote:
> 
> 11.07.2018 18:08, Salvatore D'angelo пишет:
>> Hi All,
>> 
>> After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and corosync 
>> from 2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to repeat the same scenario 
>> on Ubuntu 16.04.
> 
> 16.04 is using systemd, you need to create systemd unit. I do not know
> if there is any compatibility layer to interpret upstart configuration
> like the one for sysvinit.
> 
>> As my previous scenario I am using Docker for test purpose before move to 
>> Bare metal.
>> The scenario worked properly after I downloaded the correct dependencies 
>> versions.
>> 
>> The only problem I experienced is that in my procedure install I set 
>> corosync and pacemaker to run at startup updating the init.d scripts with 
>> this commands:
>> 
>> update-rc.d corosync defaults
>> update-rc.d pacemaker defaults 80 80
>> 
>> I noticed that links in /etc/rc are not created.
>> 
>> I have also the following errors on second update-rc.d command:
>> insserv: Service corosync has to be enabled to start service pacemaker
>> insserv: exiting now!
>> 
>> I was able to solve the issue manually replacing these lines in 
>> /etc/init.d/corosync and /etc/init.d/pacemaker:
>> # Default-Start:
>> # Default-Stop:
>> 
>> with this:
>> # Default-Start:2 3 4 5
>> # Default-Stop: 0 1 6
>> 
>> I didn’t understand if this is a bug of corosync or pacemaker or simply 
>> there is a dependency missing on Ubuntu 16.04 that was installed by default 
>> on 14.04. I found other discussion on this forum about this problem but it’s 
>> not clear the solution.
>> Thanks in advance for support.
>> 
>> 
>> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> 
>> Bugs: http://bugs.clusterlabs.org 
>> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> 
> Bugs: http://bugs.clusterlabs.org 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Andrei Borzenkov
11.07.2018 18:08, Salvatore D'angelo пишет:
> Hi All,
> 
> After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and corosync 
> from 2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to repeat the same scenario 
> on Ubuntu 16.04.

16.04 is using systemd, you need to create systemd unit. I do not know
if there is any compatibility layer to interpret upstart configuration
like the one for sysvinit.

> As my previous scenario I am using Docker for test purpose before move to 
> Bare metal.
> The scenario worked properly after I downloaded the correct dependencies 
> versions.
> 
> The only problem I experienced is that in my procedure install I set corosync 
> and pacemaker to run at startup updating the init.d scripts with this 
> commands:
> 
> update-rc.d corosync defaults
> update-rc.d pacemaker defaults 80 80
> 
> I noticed that links in /etc/rc are not created.
> 
> I have also the following errors on second update-rc.d command:
> insserv: Service corosync has to be enabled to start service pacemaker
> insserv: exiting now!
> 
> I was able to solve the issue manually replacing these lines in 
> /etc/init.d/corosync and /etc/init.d/pacemaker:
> # Default-Start:
> # Default-Stop:
> 
> with this:
> # Default-Start:2 3 4 5
> # Default-Stop: 0 1 6
> 
> I didn’t understand if this is a bug of corosync or pacemaker or simply there 
> is a dependency missing on Ubuntu 16.04 that was installed by default on 
> 14.04. I found other discussion on this forum about this problem but it’s not 
> clear the solution.
> Thanks in advance for support.
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Salvatore D'angelo
Hi All,

After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and corosync from 
2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to repeat the same scenario on 
Ubuntu 16.04.
As my previous scenario I am using Docker for test purpose before move to Bare 
metal.
The scenario worked properly after I downloaded the correct dependencies 
versions.

The only problem I experienced is that in my procedure install I set corosync 
and pacemaker to run at startup updating the init.d scripts with this commands:

update-rc.d corosync defaults
update-rc.d pacemaker defaults 80 80

I noticed that links in /etc/rc are not created.

I have also the following errors on second update-rc.d command:
insserv: Service corosync has to be enabled to start service pacemaker
insserv: exiting now!

I was able to solve the issue manually replacing these lines in 
/etc/init.d/corosync and /etc/init.d/pacemaker:
# Default-Start:
# Default-Stop:

with this:
# Default-Start:2 3 4 5
# Default-Stop: 0 1 6

I didn’t understand if this is a bug of corosync or pacemaker or simply there 
is a dependency missing on Ubuntu 16.04 that was installed by default on 14.04. 
I found other discussion on this forum about this problem but it’s not clear 
the solution.
Thanks in advance for support.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] What triggers fencing?

2018-07-11 Thread Klaus Wenninger
On 07/11/2018 04:11 PM, Ken Gaillot wrote:
> On Wed, 2018-07-11 at 11:06 +0200, Klaus Wenninger wrote:
>> On 07/11/2018 05:48 AM, Andrei Borzenkov wrote:
>>> 11.07.2018 05:45, Confidential Company пишет:
 Not true, the faster node will kill the slower node first. It is
 possible that through misconfiguration, both could die, but it's
 rare
 and easily avoided with a 'delay="15"' set on the fence config
 for the
 node you want to win.

 Don't use a delay on the other node, just the node you want to
 live in
 such a case.

 **
 1. Given Active/Passive setup, resources are
 active on Node1
 2. fence1(prefers to Node1, delay=15) and
 fence2(prefers to
 Node2, delay=30)
 3. Node2 goes down
> What do you mean by "down" in this case?
>
> If you mean the host itself has crashed, then it will not do anything,
> and node1 will fence it.
>
> If you mean node2's network goes out, so it's still functioning but no
> one can reach the managed service on it, then you are correct, the
> "wrong" node can get shot -- because you didn't specify anything about
> what the right node would be. This is a somewhat tricky area, but it
> can be done with a quorum-only node, qdisk, or fence_heuristics_ping,
> all of which are different ways of "preferring" the node that can reach
> a certain host.
Or in other words why would I - as a cluster-node - shoot the
peer to be able to start the services locally if I can somehow
tell beforehand that my services anyway wouldn't be
reachable by anybody (e.g. network disconnected).
Then it might make more sense to sit still and wait to be shot by
the other side for the case that guy is more lucky and
has e.g. access to the network.
>
> If you mean the cluster-managed resource crashes on node2, but node2
> itself is still functioning properly, then what happens depends on how
> you've configured failure recovery. By default, there is no fencing,
> and the cluster tries to restart the resource.
>
 4. Node1 thinks Node2 goes down / Node2 thinks
 Node1 goes
 down
>>> If node2 is down, it cannot think anything.
>> True. Assuming it is not really down but just somehow disconnected
>> for my answer below.
>>
 5. fence1 counts 15 seconds before he fence Node1
 while
 fence2 counts 30 seconds before he fence Node2
 6. Since fence1 do have shorter time than fence2,
 fence1
 executes and shutdown Node1.
 7. fence1(action: shutdown Node1)  will trigger
 first
 always because it has shorter delay than fence2.

 ** Okay what's important is that they should be different. But in
 the case
 above, even though Node2 goes down but Node1 has shorter delay,
 Node1 gets
 fenced/shutdown. This is a sample scenario. I don't get the
 point. Can you
 comment on this?
>> You didn't send the actual config but from your description
>> I get the scenario that way:
>>
>> fencing-resource fence1 is running on Node2 and it is there
>> to fence Node1 and it has a delay of 15s.
>> fencing-resource fence2 is running on Node1 and it is there
>> to fence Node2 and it has a delay of 30s.
>> If they now begin to fence each other at the same time the
>> node actually fenced would be Node1 of course as the
>> fencing-resource fence1 is gonna shoot 15s earlier that the
>> fence2.
>> Looks consistent to me ...
>>
>> Regards,
>> Klaus
>>
 Thanks

 On Tue, Jul 10, 2018 at 12:18 AM, Klaus Wenninger >>> t.com>
 wrote:

> On 07/09/2018 05:53 PM, Digimer wrote:
>> On 2018-07-09 11:45 AM, Klaus Wenninger wrote:
>>> On 07/09/2018 05:33 PM, Digimer wrote:
 On 2018-07-09 09:56 AM, Klaus Wenninger wrote:
> On 07/09/2018 03:49 PM, Digimer wrote:
>> On 2018-07-09 08:31 AM, Klaus Wenninger wrote:
>>> On 07/09/2018 02:04 PM, Confidential Company wrote:
 Hi,

 Any ideas what triggers fencing script or
 stonith?

 Given the setup below:
 1. I have two nodes
 2. Configured fencing on both nodes
 3. Configured delay=15 and delay=30 on fence1(for
 Node1) and
 fence2(for Node2) respectively

 *What does it mean to configured delay in
 stonith? wait for 15
> seconds
 before it fence the node?
>>> Given that on a 2-node-cluster you don't have real
>>> quorum to make
> one
>>> partial cluster fence the rest of the nodes the
>>> different delays
> are meant
>>> to prevent a fencing-race.
>>> Without different delays that would lead to both
>>> nodes fencing each
>>> other at the same time - finally both being down.
>> Not true, the faster node will kill the slower node
>> fi

Re: [ClusterLabs] Antw: OCF Return codes OCF_NOT_RUNNING

2018-07-11 Thread Ken Gaillot
On Wed, 2018-07-11 at 13:44 +0200, Ulrich Windl wrote:
> > > > Ian Underhill  schrieb am 11.07.2018
> > > > um 13:27 in
> 
> Nachricht
> :
> > im trying to understand the behaviour of pacemaker when a resource
> > monitor
> > returns OCF_NOT_RUNNING instead of OCF_ERR_GENERIC, and does
> > pacemaker
> > really care.
> > 
> > The documentation states that a return code OCF_NOT_RUNNING from a
> > monitor
> > will not result in a stop being called on that resource, as it
> > believes the
> > node is still clean.
> > 
> > https://www.clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/
> > Pacemaker 
> > _Explained/s-ocf-return-codes.html
> > 
> > This makes sense, however in practice is not what happens (unless
> > im doing
> > something wrong :) )
> > 
> > When my resource returns OCF_NOT_RUNNING for a monitor call (after
> > a start
> > has been performed) a stop is called.
> 
> Well: it depends: If your start was successful, pacemaker believes
> the resource is running. If the monitor says it's stopped, pacemaker
> seems to try a "clean stop" by calling the stop method (possibly
> before trying to start it again). Am I right?

Yes, I think the documentation is wrong. It depends on what state the
cluster thinks the resource is supposed to be in. If the cluster
expects the resource is already stopped (for example, when doing a
probe), then "not running" will not result in a stop. If the cluster
expects the resource to be running (for example, when doing a normal
recurring monitor), then the documentation is incorrect, recovery
includes a stop then start.

> > if I have a resource threshold set >1,  i get start->monitor->stop
> > cycle
> > until the threshold is consumed
> 
> Then either your start is broken, or your monitor is broken. Try to
> validate your RA using ocf-tester before using it.
> 
> Regards,
> Ulrich
> 
> > 
> > /Ian.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] What triggers fencing?

2018-07-11 Thread Confidential Company
Message: 1
Date: Wed, 11 Jul 2018 11:06:56 +0200
From: Klaus Wenninger 
To: Cluster Labs - All topics related to open-source clustering
welcomed , Andrei Borzenkov

Subject: Re: [ClusterLabs] What triggers fencing?
Message-ID: 
Content-Type: text/plain; charset=utf-8

On 07/11/2018 05:48 AM, Andrei Borzenkov wrote:
> 11.07.2018 05:45, Confidential Company ?:
>> Not true, the faster node will kill the slower node first. It is
>> possible that through misconfiguration, both could die, but it's rare
>> and easily avoided with a 'delay="15"' set on the fence config for the
>> node you want to win.
>>
>> Don't use a delay on the other node, just the node you want to live in
>> such a case.
>>
>> **
>> 1. Given Active/Passive setup, resources are active on
Node1
>> 2. fence1(prefers to Node1, delay=15) and fence2(prefers
to
>> Node2, delay=30)
>> 3. Node2 goes down
>> 4. Node1 thinks Node2 goes down / Node2 thinks Node1 goes
>> down
> If node2 is down, it cannot think anything.

True. Assuming it is not really down but just somehow disconnected
for my answer below.

>
>> 5. fence1 counts 15 seconds before he fence Node1 while
>> fence2 counts 30 seconds before he fence Node2
>> 6. Since fence1 do have shorter time than fence2, fence1
>> executes and shutdown Node1.
>> 7. fence1(action: shutdown Node1)  will trigger first
>> always because it has shorter delay than fence2.
>>
>> ** Okay what's important is that they should be different. But in the
case
>> above, even though Node2 goes down but Node1 has shorter delay, Node1
gets
>> fenced/shutdown. This is a sample scenario. I don't get the point. Can
you
>> comment on this?

You didn't send the actual config but from your description
I get the scenario that way:

fencing-resource fence1 is running on Node2 and it is there
to fence Node1 and it has a delay of 15s.
fencing-resource fence2 is running on Node1 and it is there
to fence Node2 and it has a delay of 30s.
If they now begin to fence each other at the same time the
node actually fenced would be Node1 of course as the
fencing-resource fence1 is gonna shoot 15s earlier that the
fence2.
Looks consistent to me ...

Regards,
Klaus



***
Yes, that is right Klaus. fence1 running on Node2 will fence Node1, fence1
will execute first whichever Node goes down because it has shorter delay.
But if Node2 goes down or disconnected, how can it be fenced by Node1 using
fence2, if fence2 cannot be triggered because fence1 always comes first.

My point here is that giving delay on fencing resolves the issue of double
fencing, but it doesnt resolve or doesnt know who's Node should be fenced.
Even though Node2 gets disconnected, Node1 will be fenced and the whole
service totally goes down.

**Let me share you my actual config:

I have two ESXI hosts, 2 virtual machines, 2 interfaces on each (1=corosync
interface, 1=interface for VM to contact ESXI host)

Pacemaker Nodes:
 ArcosRhel1 ArcosRhel2

Resources:
 Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: cidr_netmask=32 ip=172.16.10.243
  Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
  start interval=0s timeout=20s (ClusterIP-start-interval-0s)
  stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)

Stonith Devices:
 Resource: Fence1 (class=stonith type=fence_vmware_soap)
  Attributes: action=off ipaddr=172.16.10.201 login=test passwd=testing
pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s port=ArcosRhel1
ssl_insecure=1
  Operations: monitor interval=60s (Fence1-monitor-interval-60s)
 Resource: fence2 (class=stonith type=fence_vmware_soap)
  Attributes: action=off ipaddr=172.16.10.202 login=test passwd=testing
pcmk_delay_max=10s pcmk_host_list=ArcosRhel2 pcmk_monitor_timeout=60s
port=ArcosRhel2 ssl_insecure=1
  Operations: monitor interval=60s (fence2-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: Fence1
Enabled on: ArcosRhel2 (score:INFINITY)
(id:location-Fence1-ArcosRhel2-INFINITY)
  Resource: fence2
Enabled on: ArcosRhel1 (score:INFINITY)
(id:location-fence2-ArcosRhel1-INFINITY)
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: ARCOSCLUSTER
 dc-version: 1.1.16-12.el7-94ff4df
 have-watchdog: false
 last-lrm-refresh: 1531300540
 stonith-enabled: true

*
>>
>> Thanks
>>
>> On Tue, Jul 10, 2018 at 12:18 AM, Klaus Wenninger 
>> wrote:
>>
>>> On 07/09/2018 05:53 PM, Digimer wrote:
 On 2018-07-09 11:45 AM, Klaus Wenninger wrote:
> On 07/09/2018 05:33 PM, Digimer wrote:
>> On 2018-07-09 09:56 AM, Klaus Wenninger wrote:
>>> On 07/09/2018 03:49 PM, Digimer wrote:
 On 2018-07-09 08:31 AM, Klaus Wenninger wrote:

Re: [ClusterLabs] What triggers fencing?

2018-07-11 Thread Ken Gaillot
On Wed, 2018-07-11 at 11:06 +0200, Klaus Wenninger wrote:
> On 07/11/2018 05:48 AM, Andrei Borzenkov wrote:
> > 11.07.2018 05:45, Confidential Company пишет:
> > > Not true, the faster node will kill the slower node first. It is
> > > possible that through misconfiguration, both could die, but it's
> > > rare
> > > and easily avoided with a 'delay="15"' set on the fence config
> > > for the
> > > node you want to win.
> > > 
> > > Don't use a delay on the other node, just the node you want to
> > > live in
> > > such a case.
> > > 
> > > **
> > > 1. Given Active/Passive setup, resources are
> > > active on Node1
> > > 2. fence1(prefers to Node1, delay=15) and
> > > fence2(prefers to
> > > Node2, delay=30)
> > > 3. Node2 goes down

What do you mean by "down" in this case?

If you mean the host itself has crashed, then it will not do anything,
and node1 will fence it.

If you mean node2's network goes out, so it's still functioning but no
one can reach the managed service on it, then you are correct, the
"wrong" node can get shot -- because you didn't specify anything about
what the right node would be. This is a somewhat tricky area, but it
can be done with a quorum-only node, qdisk, or fence_heuristics_ping,
all of which are different ways of "preferring" the node that can reach
a certain host.

If you mean the cluster-managed resource crashes on node2, but node2
itself is still functioning properly, then what happens depends on how
you've configured failure recovery. By default, there is no fencing,
and the cluster tries to restart the resource.

> > > 4. Node1 thinks Node2 goes down / Node2 thinks
> > > Node1 goes
> > > down
> > 
> > If node2 is down, it cannot think anything.
> 
> True. Assuming it is not really down but just somehow disconnected
> for my answer below.
> 
> > 
> > > 5. fence1 counts 15 seconds before he fence Node1
> > > while
> > > fence2 counts 30 seconds before he fence Node2
> > > 6. Since fence1 do have shorter time than fence2,
> > > fence1
> > > executes and shutdown Node1.
> > > 7. fence1(action: shutdown Node1)  will trigger
> > > first
> > > always because it has shorter delay than fence2.
> > > 
> > > ** Okay what's important is that they should be different. But in
> > > the case
> > > above, even though Node2 goes down but Node1 has shorter delay,
> > > Node1 gets
> > > fenced/shutdown. This is a sample scenario. I don't get the
> > > point. Can you
> > > comment on this?
> 
> You didn't send the actual config but from your description
> I get the scenario that way:
> 
> fencing-resource fence1 is running on Node2 and it is there
> to fence Node1 and it has a delay of 15s.
> fencing-resource fence2 is running on Node1 and it is there
> to fence Node2 and it has a delay of 30s.
> If they now begin to fence each other at the same time the
> node actually fenced would be Node1 of course as the
> fencing-resource fence1 is gonna shoot 15s earlier that the
> fence2.
> Looks consistent to me ...
> 
> Regards,
> Klaus
> 
> > > 
> > > Thanks
> > > 
> > > On Tue, Jul 10, 2018 at 12:18 AM, Klaus Wenninger  > > t.com>
> > > wrote:
> > > 
> > > > On 07/09/2018 05:53 PM, Digimer wrote:
> > > > > On 2018-07-09 11:45 AM, Klaus Wenninger wrote:
> > > > > > On 07/09/2018 05:33 PM, Digimer wrote:
> > > > > > > On 2018-07-09 09:56 AM, Klaus Wenninger wrote:
> > > > > > > > On 07/09/2018 03:49 PM, Digimer wrote:
> > > > > > > > > On 2018-07-09 08:31 AM, Klaus Wenninger wrote:
> > > > > > > > > > On 07/09/2018 02:04 PM, Confidential Company wrote:
> > > > > > > > > > > Hi,
> > > > > > > > > > > 
> > > > > > > > > > > Any ideas what triggers fencing script or
> > > > > > > > > > > stonith?
> > > > > > > > > > > 
> > > > > > > > > > > Given the setup below:
> > > > > > > > > > > 1. I have two nodes
> > > > > > > > > > > 2. Configured fencing on both nodes
> > > > > > > > > > > 3. Configured delay=15 and delay=30 on fence1(for
> > > > > > > > > > > Node1) and
> > > > > > > > > > > fence2(for Node2) respectively
> > > > > > > > > > > 
> > > > > > > > > > > *What does it mean to configured delay in
> > > > > > > > > > > stonith? wait for 15
> > > > 
> > > > seconds
> > > > > > > > > > > before it fence the node?
> > > > > > > > > > 
> > > > > > > > > > Given that on a 2-node-cluster you don't have real
> > > > > > > > > > quorum to make
> > > > 
> > > > one
> > > > > > > > > > partial cluster fence the rest of the nodes the
> > > > > > > > > > different delays
> > > > 
> > > > are meant
> > > > > > > > > > to prevent a fencing-race.
> > > > > > > > > > Without different delays that would lead to both
> > > > > > > > > > nodes fencing each
> > > > > > > > > > other at the same time - finally both being down.
> > > > > > > > > 
> > > > > > > > > Not true, the faster node will kill the slower node
> > > > > > > > > first. It is
> > > > > > > > > possible that through misconfiguration, both co

[ClusterLabs] Antw: OCF Return codes OCF_NOT_RUNNING

2018-07-11 Thread Ulrich Windl
>>> Ian Underhill  schrieb am 11.07.2018 um 13:27 in
Nachricht
:
> im trying to understand the behaviour of pacemaker when a resource monitor
> returns OCF_NOT_RUNNING instead of OCF_ERR_GENERIC, and does pacemaker
> really care.
> 
> The documentation states that a return code OCF_NOT_RUNNING from a monitor
> will not result in a stop being called on that resource, as it believes the
> node is still clean.
> 
> https://www.clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker 
> _Explained/s-ocf-return-codes.html
> 
> This makes sense, however in practice is not what happens (unless im doing
> something wrong :) )
> 
> When my resource returns OCF_NOT_RUNNING for a monitor call (after a start
> has been performed) a stop is called.

Well: it depends: If your start was successful, pacemaker believes the resource 
is running. If the monitor says it's stopped, pacemaker seems to try a "clean 
stop" by calling the stop method (possibly before trying to start it again). Am 
I right?

> 
> if I have a resource threshold set >1,  i get start->monitor->stop cycle
> until the threshold is consumed

Then either your start is broken, or your monitor is broken. Try to validate 
your RA using ocf-tester before using it.

Regards,
Ulrich

> 
> /Ian.




___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] OCF Return codes OCF_NOT_RUNNING

2018-07-11 Thread Ian Underhill
im trying to understand the behaviour of pacemaker when a resource monitor
returns OCF_NOT_RUNNING instead of OCF_ERR_GENERIC, and does pacemaker
really care.

The documentation states that a return code OCF_NOT_RUNNING from a monitor
will not result in a stop being called on that resource, as it believes the
node is still clean.

https://www.clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-ocf-return-codes.html

This makes sense, however in practice is not what happens (unless im doing
something wrong :) )

When my resource returns OCF_NOT_RUNNING for a monitor call (after a start
has been performed) a stop is called.

if I have a resource threshold set >1,  i get start->monitor->stop cycle
until the threshold is consumed

/Ian.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] What triggers fencing?

2018-07-11 Thread Klaus Wenninger
On 07/11/2018 05:48 AM, Andrei Borzenkov wrote:
> 11.07.2018 05:45, Confidential Company пишет:
>> Not true, the faster node will kill the slower node first. It is
>> possible that through misconfiguration, both could die, but it's rare
>> and easily avoided with a 'delay="15"' set on the fence config for the
>> node you want to win.
>>
>> Don't use a delay on the other node, just the node you want to live in
>> such a case.
>>
>> **
>> 1. Given Active/Passive setup, resources are active on Node1
>> 2. fence1(prefers to Node1, delay=15) and fence2(prefers to
>> Node2, delay=30)
>> 3. Node2 goes down
>> 4. Node1 thinks Node2 goes down / Node2 thinks Node1 goes
>> down
> If node2 is down, it cannot think anything.

True. Assuming it is not really down but just somehow disconnected
for my answer below.

>
>> 5. fence1 counts 15 seconds before he fence Node1 while
>> fence2 counts 30 seconds before he fence Node2
>> 6. Since fence1 do have shorter time than fence2, fence1
>> executes and shutdown Node1.
>> 7. fence1(action: shutdown Node1)  will trigger first
>> always because it has shorter delay than fence2.
>>
>> ** Okay what's important is that they should be different. But in the case
>> above, even though Node2 goes down but Node1 has shorter delay, Node1 gets
>> fenced/shutdown. This is a sample scenario. I don't get the point. Can you
>> comment on this?

You didn't send the actual config but from your description
I get the scenario that way:

fencing-resource fence1 is running on Node2 and it is there
to fence Node1 and it has a delay of 15s.
fencing-resource fence2 is running on Node1 and it is there
to fence Node2 and it has a delay of 30s.
If they now begin to fence each other at the same time the
node actually fenced would be Node1 of course as the
fencing-resource fence1 is gonna shoot 15s earlier that the
fence2.
Looks consistent to me ...

Regards,
Klaus

>>
>> Thanks
>>
>> On Tue, Jul 10, 2018 at 12:18 AM, Klaus Wenninger 
>> wrote:
>>
>>> On 07/09/2018 05:53 PM, Digimer wrote:
 On 2018-07-09 11:45 AM, Klaus Wenninger wrote:
> On 07/09/2018 05:33 PM, Digimer wrote:
>> On 2018-07-09 09:56 AM, Klaus Wenninger wrote:
>>> On 07/09/2018 03:49 PM, Digimer wrote:
 On 2018-07-09 08:31 AM, Klaus Wenninger wrote:
> On 07/09/2018 02:04 PM, Confidential Company wrote:
>> Hi,
>>
>> Any ideas what triggers fencing script or stonith?
>>
>> Given the setup below:
>> 1. I have two nodes
>> 2. Configured fencing on both nodes
>> 3. Configured delay=15 and delay=30 on fence1(for Node1) and
>> fence2(for Node2) respectively
>>
>> *What does it mean to configured delay in stonith? wait for 15
>>> seconds
>> before it fence the node?
> Given that on a 2-node-cluster you don't have real quorum to make
>>> one
> partial cluster fence the rest of the nodes the different delays
>>> are meant
> to prevent a fencing-race.
> Without different delays that would lead to both nodes fencing each
> other at the same time - finally both being down.
 Not true, the faster node will kill the slower node first. It is
 possible that through misconfiguration, both could die, but it's rare
 and easily avoided with a 'delay="15"' set on the fence config for
>>> the
 node you want to win.
>>> What exactly is not true? Aren't we saying the same?
>>> Of course one of the delays can be 0 (most important is that
>>> they are different).
>> Perhaps I misunderstood your message. It seemed to me that the
>> implication was that fencing in 2-node without a delay always ends up
>> with both nodes being down, which isn't the case. It can happen if the
>> fence methods are not setup right (ie: the node isn't set to
>>> immediately
>> power off on ACPI power button event).
> Yes, a misunderstanding I guess.
>
> Should have been more verbose in saying that due to the
> time between the fencing-command fired off to the fencing
> device and the actual fencing taking place (as you state
> dependent on how it is configured in detail - but a measurable
> time in all cases) there is a certain probability that when
> both nodes start fencing at roughly the same time we will
> end up with 2 nodes down.
>
> Everybody has to find his own tradeoff between reliability
> fence-races are prevented and fencing delay I guess.
 We've used this;

 1. IPMI (with the guest OS set to immediately power off) as primary,
 with a 15 second delay on the active node.

 2. Two Switched PDUs (two power circuits, two PSUs) as backup fencing
 for when IPMI fails, with no delay.

 In ~8 years, across dozens and dozens of clusters and countless fence
 action