On Fri, 2021-11-12 at 17:31 +0000, S Rogers wrote: > Hi, I'm hoping someone will be able to point me in the right > direction. > > I am configuring a two-node active/passive cluster that utilises the > PostgreSQL PAF resource agent. Each node has two NICs, therefore the > cluster is configured with two corosync links - one on each network > (one network is the public network, the other is effectively private > and just used for cluster communication). The cluster has a virtual > IP resource, which has a colocation constraint to keep it together > with the primary Postgres instance. > > I am trying to protect against the scenario where the public network > interface on the active node goes down, in which case I want a > failover to occur and the other node to take over and host the > primary Postgres instance and the public virtual IP. My current > approach is to use ocf:heartbeat:ethmonitor to monitor the public > interface along with a location constraint to ensure that the virtual > IP must be on a node where the public interface is UP. > > With this configuration, if I disconnect the active node from the > public network, Pacemaker attempts to move the primary PostgreSQL and > virtual IP to the other node. The problem is that it attempts to stop > the resources gracefully, which causes the pgsql resource to error > with "Switchover has been canceled from pre-promote action" (which I > believe is because PostgreSQL shuts down, but can't communicate with > the standby during the shutdown - a similar situation to what is > described here: https://github.com/ClusterLabs/PAF/issues/149) > > Ideally, if the public network interface on the active node goes down > I would want to take that node offline (either fence it or put it in > standby mode, so that no resources can run on it), leaving just the > other node in the cluster as the active node. Then the old primary > can be rebuilt from the new primary in order to join the cluster > again. However, I can't figure out a way to cause the active node to > be fenced as a result of ocf:heartbeat:ethmonitor detecting that the > interface has gone down. > > Does anyone have any ideas/pointers how I could achieve this, or an > alternative approach? > > Hopefully that makes sense. Any help is appreciated! > > Thanks.
Failure handling is configurable via the on-fail meta-attribute. You can set on-fail=fence for the ethmonitor resource's monitor action to fence the node if the monitor fails. There's also on-fail=standby, but that will still try to stop any active resources gracefully, so it doesn't help in this case. -- Ken Gaillot <[email protected]> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
