Re: [ClusterLabs] Fence node when network interface goes down

2021-11-15 Thread Andrei Borzenkov
On Mon, Nov 15, 2021 at 3:32 PM S Rogers  wrote:
>>
>> The only solution here - as long as fencing node on external
>> connectivity loss is acceptable - is modifying ethmonitor RA to fail
>> monitor operation in this case.
>
> I was hoping to find a way to achieve the desired outcome without resorting 
> to a custom RA, but it does appear to be the only solution.
>

Well, looking at it from a different angle - you could use the knet
nozzle interface for replication which means your postgres
connectivity is guaranteed to be the same as pacemaker/corosync
connectivity.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-15 Thread S Rogers


On 15/11/2021 12:03, Klaus Wenninger wrote:



On Mon, Nov 15, 2021 at 12:19 PM Andrei Borzenkov 
 wrote:


On Mon, Nov 15, 2021 at 1:18 PM Klaus Wenninger
 wrote:
>
>
>
> On Mon, Nov 15, 2021 at 10:37 AM S Rogers
 wrote:
>>
>> I had thought about doing that, but the cluster is then
dependent on the
>> external system, and if that external system was to go down or
become
>> unreachable for any reason then it would falsely cause the
cluster to
>> failover or worse it could even take the cluster down
completely, if the
>> external system goes down and both nodes cannot ping it.
>
> You wouldn't necessarily have to ban resources from nodes that can't
> reach the external network. It would be enough to make them prefer
> the location that has connection. So if both lose connection 
one side
> would still stay up.
> Not to depend on something really external you might use the
> router to your external network as ping target.
> In case of fencing - triggered by whatever - and a potential
fence-race

The problem here is that nothing really triggers fencing. What
happens, is


Got that! Which is why I gave the hint how to prevent shutting down
services with ping first.
Taking care of what happens when nodes are fenced still makes sense.
Imagine a fence-race where the node running services loses just
to afterwards get the services moved back when it comes up again.

Klaus
Thanks, I wasn't aware of priority-fencing-delay. While it doesn't solve 
this problem, I can still use it to improve the fencing behaviour of the 
cluster in general.


Unfortunately, in some situations this cluster will be deployed in a 
completely isolated network so there may not even be a router that we 
can use as a ping target, and we can't guarantee the presence of any 
other system on the network that we could reliably use as a ping target.




- two postgres lose connection over external network, but cluster
nodes retain connectivity over another network
- postgres RA compares "latest timestamp" when selecting the best node
to fail over to
- primary postgres has better timestamp, so RA simply does not
consider secondary as suitable for (atomatic) failover

The only solution here - as long as fencing node on external
connectivity loss is acceptable - is modifying ethmonitor RA to fail
monitor operation in this case.

I was hoping to find a way to achieve the desired outcome without 
resorting to a custom RA, but it does appear to be the only solution.


This may not be the right audience, but does anyone know if it is a 
viable change to add an additional parameter to the ethmonitor RA that 
allows users to override the desired behaviour when the monitor 
operation fails? (ie, a 'monitor_force_fail' parameter that when set to 
true will cause the monitor operation to fail if it determines the 
interface is down)


Being relatively new to pacemaker, I don't know whether this goes 
against RA conventions/practices.




> you might use the rather new feature priority-fencing-delay
(give the node
> that is running valuable resources a benefit in the race) or go for
> fence_heuristics_ping (pseudo fence-resource that together with a
> fencing-topology prevents the node without access to a certain IP
> from fencing the other node).
>

https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html
>

https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py
>
> Klaus
> ___
>>
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home:https://www.clusterlabs.org/___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-15 Thread Klaus Wenninger
On Mon, Nov 15, 2021 at 12:19 PM Andrei Borzenkov 
wrote:

> On Mon, Nov 15, 2021 at 1:18 PM Klaus Wenninger 
> wrote:
> >
> >
> >
> > On Mon, Nov 15, 2021 at 10:37 AM S Rogers 
> wrote:
> >>
> >> I had thought about doing that, but the cluster is then dependent on the
> >> external system, and if that external system was to go down or become
> >> unreachable for any reason then it would falsely cause the cluster to
> >> failover or worse it could even take the cluster down completely, if the
> >> external system goes down and both nodes cannot ping it.
> >
> > You wouldn't necessarily have to ban resources from nodes that can't
> > reach the external network. It would be enough to make them prefer
> > the location that has connection. So if both lose connection  one side
> > would still stay up.
> > Not to depend on something really external you might use the
> > router to your external network as ping target.
> > In case of fencing - triggered by whatever - and a potential fence-race
>
> The problem here is that nothing really triggers fencing. What happens, is
>

Got that! Which is why I gave the hint how to prevent shutting down
services with ping first.
Taking care of what happens when nodes are fenced still makes sense.
Imagine a fence-race where the node running services loses just
to afterwards get the services moved back when it comes up again.

Klaus


>
> - two postgres lose connection over external network, but cluster
> nodes retain connectivity over another network
> - postgres RA compares "latest timestamp" when selecting the best node
> to fail over to
> - primary postgres has better timestamp, so RA simply does not
> consider secondary as suitable for (atomatic) failover
>
> The only solution here - as long as fencing node on external
> connectivity loss is acceptable - is modifying ethmonitor RA to fail
> monitor operation in this case.
>
> > you might use the rather new feature priority-fencing-delay (give the
> node
> > that is running valuable resources a benefit in the race) or go for
> > fence_heuristics_ping (pseudo fence-resource that together with a
> > fencing-topology prevents the node without access to a certain IP
> > from fencing the other node).
> >
> https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html
> >
> https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py
> >
> > Klaus
> > ___
> >>
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >>
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-15 Thread Andrei Borzenkov
On Mon, Nov 15, 2021 at 1:18 PM Klaus Wenninger  wrote:
>
>
>
> On Mon, Nov 15, 2021 at 10:37 AM S Rogers  wrote:
>>
>> I had thought about doing that, but the cluster is then dependent on the
>> external system, and if that external system was to go down or become
>> unreachable for any reason then it would falsely cause the cluster to
>> failover or worse it could even take the cluster down completely, if the
>> external system goes down and both nodes cannot ping it.
>
> You wouldn't necessarily have to ban resources from nodes that can't
> reach the external network. It would be enough to make them prefer
> the location that has connection. So if both lose connection  one side
> would still stay up.
> Not to depend on something really external you might use the
> router to your external network as ping target.
> In case of fencing - triggered by whatever - and a potential fence-race

The problem here is that nothing really triggers fencing. What happens, is

- two postgres lose connection over external network, but cluster
nodes retain connectivity over another network
- postgres RA compares "latest timestamp" when selecting the best node
to fail over to
- primary postgres has better timestamp, so RA simply does not
consider secondary as suitable for (atomatic) failover

The only solution here - as long as fencing node on external
connectivity loss is acceptable - is modifying ethmonitor RA to fail
monitor operation in this case.

> you might use the rather new feature priority-fencing-delay (give the node
> that is running valuable resources a benefit in the race) or go for
> fence_heuristics_ping (pseudo fence-resource that together with a
> fencing-topology prevents the node without access to a certain IP
> from fencing the other node).
> https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html
> https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py
>
> Klaus
> ___
>>
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-15 Thread Klaus Wenninger
On Mon, Nov 15, 2021 at 10:37 AM S Rogers  wrote:

> I had thought about doing that, but the cluster is then dependent on the
> external system, and if that external system was to go down or become
> unreachable for any reason then it would falsely cause the cluster to
> failover or worse it could even take the cluster down completely, if the
> external system goes down and both nodes cannot ping it.
>
You wouldn't necessarily have to ban resources from nodes that can't
reach the external network. It would be enough to make them prefer
the location that has connection. So if both lose connection  one side
would still stay up.
Not to depend on something really external you might use the
router to your external network as ping target.
In case of fencing - triggered by whatever - and a potential fence-race
you might use the rather new feature priority-fencing-delay (give the node
that is running valuable resources a benefit in the race) or go for
fence_heuristics_ping (pseudo fence-resource that together with a
fencing-topology prevents the node without access to a certain IP
from fencing the other node).
https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html
https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py

Klaus
___

> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-15 Thread S Rogers
I had thought about doing that, but the cluster is then dependent on the 
external system, and if that external system was to go down or become 
unreachable for any reason then it would falsely cause the cluster to 
failover or worse it could even take the cluster down completely, if the 
external system goes down and both nodes cannot ping it.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-15 Thread Strahil Nikolov via Users
Have you tried with ping and a location constraint for avoiding hosts that 
cannot ping an extrrnal system.
Best Regards,Strahil Nikolov
 
 
  On Mon, Nov 15, 2021 at 0:07, S Rogers wrote:   
Using on-fail=fence is what I initially tried, but it doesn't work 
unfortunately.

It looks like this is because the ethmonitor monitor operation won't 
actually fail when it detects a downed interface. It'll only fail if it 
is unable to update the CIB, as per this comment: 
https://github.com/ClusterLabs/resource-agents/blob/4824a7a83765a0596b7d9856d00102f53c8ce123/heartbeat/ethmonitor#L518

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
  
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-14 Thread S Rogers
Using on-fail=fence is what I initially tried, but it doesn't work 
unfortunately.


It looks like this is because the ethmonitor monitor operation won't 
actually fail when it detects a downed interface. It'll only fail if it 
is unable to update the CIB, as per this comment: 
https://github.com/ClusterLabs/resource-agents/blob/4824a7a83765a0596b7d9856d00102f53c8ce123/heartbeat/ethmonitor#L518


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-14 Thread S Rogers
The mentioned error occurs when attempting to promote the PostgreSQL 
resource on the standby node, after the master PostgreSQL resource is 
stopped.


For info, here is my configuration:

Corosync Nodes:
 node1.local node2.local
Pacemaker Nodes:
 node1.local node2.local

Resources:
 Clone: public_network_monitor-clone
  Resource: public_network_monitor (class=ocf provider=heartbeat 
type=ethmonitor)

   Attributes: interface=eth0 link_status_only=true name=ethmonitor-public
   Operations: monitor interval=10s timeout=60s 
(public_network_monitor-monitor-interval-10s)
   start interval=0s timeout=60s 
(public_network_monitor-start-interval-0s)
   stop interval=0s timeout=20s 
(public_network_monitor-stop-interval-0s)

 Clone: pgsqld-clone
  Meta Attrs: notify=true promotable=true
  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/lib/postgresql/12/bin 
datadir=/var/lib/postgresql/12/main pgdata=/etc/postgresql/12/main

   Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
   methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
   monitor interval=15s role=Master timeout=10s 
(pgsqld-monitor-interval-15s)
   monitor interval=16s role=Slave timeout=10s 
(pgsqld-monitor-interval-16s)

   notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
   promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
   reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
   start interval=0s timeout=60s (pgsqld-start-interval-0s)
   stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
 Resource: public_virtual_ip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: cidr_netmask=24 ip=192.168.50.3 nic=mgnet0
  Operations: monitor interval=30s (public_virtual_ip-monitor-interval-30s)
  start interval=0s timeout=20s 
(public_virtual_ip-start-interval-0s)
  stop interval=0s timeout=20s 
(public_virtual_ip-stop-interval-0s)


Stonith Devices:
 Resource: node1_fence_agent (class=stonith type=fence_ssh)
  Attributes: hostname=192.168.60.1 pcmk_delay_base=15 
pcmk_host_list=node1.local user=root

  Operations: monitor interval=60s (node1_fence_agent-monitor-interval-60s)
 Resource: node2_fence_agent (class=stonith type=fence_ssh)
  Attributes: hostname=192.168.60.2 pcmk_host_list=node2.local user=root
  Operations: monitor interval=60s (node2_fence_agent-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: node1_fence_agent
    Disabled on: node1.local (score:-INFINITY) 
(id:location-node1_fence_agent-node1.local--INFINITY)

  Resource: node2_fence_agent
    Disabled on: node2.local (score:-INFINITY) 
(id:location-node2_fence_agent-node2.local--INFINITY)

  Resource: public_virtual_ip
    Constraint: location-public_virtual_ip
  Rule: score=INFINITY  (id:location-public_virtual_ip-rule)
    Expression: ethmonitor-public eq 1 
(id:location-public_virtual_ip-rule-expr)

Ordering Constraints:
  promote pgsqld-clone then start public_virtual_ip (kind:Mandatory) 
(non-symmetrical) (id:order-pgsqld-clone-public_virtual_ip-Mandatory)
  demote pgsqld-clone then stop public_virtual_ip (kind:Mandatory) 
(non-symmetrical) (id:order-pgsqld-clone-public_virtual_ip-Mandatory-1)

Colocation Constraints:
  public_virtual_ip with pgsqld-clone (score:INFINITY) 
(rsc-role:Started) (with-rsc-role:Master) 
(id:colocation-public_virtual_ip-pgsqld-clone-INFINITY)

Ticket Constraints:


This is my understanding of the sequence of events:

1. Node1 is running the PostgreSQL resource as master, Node2 is running 
the PostgreSQL resource as standby. Everything is working okay at this 
point.
2. On Node1, the public network goes down and ethmonitor changes the 
ethmonitor-public node attribute from 1 to 0.
3. The location-public_virtual_ip constraint (which requires the IP to 
run on a node with ethmonitor-public==1) kicks in, and pacemaker demotes 
the master PostgreSQL resource so that it can then promote it on Node2.
4. The primary PostgreSQL instance on Node2 attempts to shutdown in 
response to the demotion, but it can't connect to the standby so is 
unable to stop cleanly. The PostgreSQL resource shows as demoting for 60 
seconds, as below:


Clone Set: pgsqld-clone [pgsqld] (promotable)
 pgsqld (ocf::heartbeat:pgsqlms):   Demoting node1.local
 Slaves: [ node2.local ]

5. After a minute, the demotion finishes and pacemaker attempts to 
promote the PostgreSQL resource on Node2. This action fails with the 
"Switchover has been canceled from pre-promote action" error, because 
the standby didn't receive the final WAL activity from the primary.
6. Due to the failed promotion on Node2, PAF/Pacemaker promotes the 
PostgreSQL resource on Node1 again. However, due to the public network 
interface being down, the PostgreSQL and virtual IP resources provided 
by the HA cluster are now completely 

[ClusterLabs] Fence node when network interface goes down

2021-11-13 Thread S Rogers
Hi, I'm hoping someone will be able to point me in the right direction.

I am configuring a two-node active/passive cluster that utilises the
PostgreSQL PAF resource agent. Each node has two NICs, therefore the
cluster is configured with two corosync links - one on each network (one
network is the public network, the other is effectively private and just
used for cluster communication). The cluster has a virtual IP resource,
which has a colocation constraint to keep it together with the primary
Postgres instance.

I am trying to protect against the scenario where the public network
interface on the active node goes down, in which case I want a failover to
occur and the other node to take over and host the primary Postgres
instance and the public virtual IP. My current approach is to use
ocf:heartbeat:ethmonitor to monitor the public interface along with a
location constraint to ensure that the virtual IP must be on a node where
the public interface is UP.

With this configuration, if I disconnect the active node from the public
network, Pacemaker attempts to move the primary PostgreSQL and virtual IP
to the other node. The problem is that it attempts to stop the resources
gracefully, which causes the pgsql resource to error with "Switchover has
been canceled from pre-promote action" (which I believe is because
PostgreSQL shuts down, but can't communicate with the standby during the
shutdown - a similar situation to what is described here:
https://github.com/ClusterLabs/PAF/issues/149)

Ideally, if the public network interface on the active node goes down I
would want to take that node offline (either fence it or put it in standby
mode, so that no resources can run on it), leaving just the other node in
the cluster as the active node. Then the old primary can be rebuilt from
the new primary in order to join the cluster again. However, I can't figure
out a way to cause the active node to be fenced as a result of
ocf:heartbeat:ethmonitor detecting that the interface has gone down.

Does anyone have any ideas/pointers how I could achieve this, or an
alternative approach?

Hopefully that makes sense. Any help is appreciated!

Thanks.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-12 Thread Ken Gaillot
On Fri, 2021-11-12 at 17:31 +, S Rogers wrote:
> Hi, I'm hoping someone will be able to point me in the right
> direction.
> 
> I am configuring a two-node active/passive cluster that utilises the
> PostgreSQL PAF resource agent. Each node has two NICs, therefore the
> cluster is configured with two corosync links - one on each network
> (one network is the public network, the other is effectively private
> and just used for cluster communication). The cluster has a virtual
> IP resource, which has a colocation constraint to keep it together
> with the primary Postgres instance.
> 
> I am trying to protect against the scenario where the public network
> interface on the active node goes down, in which case I want a
> failover to occur and the other node to take over and host the
> primary Postgres instance and the public virtual IP. My current
> approach is to use ocf:heartbeat:ethmonitor to monitor the public
> interface along with a location constraint to ensure that the virtual
> IP must be on a node where the public interface is UP.
> 
> With this configuration, if I disconnect the active node from the
> public network, Pacemaker attempts to move the primary PostgreSQL and
> virtual IP to the other node. The problem is that it attempts to stop
> the resources gracefully, which causes the pgsql resource to error
> with "Switchover has been canceled from pre-promote action" (which I
> believe is because PostgreSQL shuts down, but can't communicate with
> the standby during the shutdown - a similar situation to what is
> described here: https://github.com/ClusterLabs/PAF/issues/149)
> 
> Ideally, if the public network interface on the active node goes down
> I would want to take that node offline (either fence it or put it in
> standby mode, so that no resources can run on it), leaving just the
> other node in the cluster as the active node. Then the old primary
> can be rebuilt from the new primary in order to join the cluster
> again. However, I can't figure out a way to cause the active node to
> be fenced as a result of ocf:heartbeat:ethmonitor detecting that the
> interface has gone down.
> 
> Does anyone have any ideas/pointers how I could achieve this, or an
> alternative approach?
> 
> Hopefully that makes sense. Any help is appreciated!
> 
> Thanks.

Failure handling is configurable via the on-fail meta-attribute. You
can set on-fail=fence for the ethmonitor resource's monitor action to
fence the node if the monitor fails. There's also on-fail=standby, but
that will still try to stop any active resources gracefully, so it
doesn't help in this case.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-12 Thread Andrei Borzenkov
On 12.11.2021 20:31, S Rogers wrote:
> Hi, I'm hoping someone will be able to point me in the right direction.
> 
> I am configuring a two-node active/passive cluster that utilises the
> PostgreSQL PAF resource agent. Each node has two NICs, therefore the
> cluster is configured with two corosync links - one on each network (one
> network is the public network, the other is effectively private and just
> used for cluster communication). The cluster has a virtual IP resource,
> which has a colocation constraint to keep it together with the primary
> Postgres instance.
> 
> I am trying to protect against the scenario where the public network
> interface on the active node goes down, in which case I want a failover to
> occur and the other node to take over and host the primary Postgres
> instance and the public virtual IP. My current approach is to use
> ocf:heartbeat:ethmonitor to monitor the public interface along with a
> location constraint to ensure that the virtual IP must be on a node where
> the public interface is UP.
> 
> With this configuration, if I disconnect the active node from the public
> network, Pacemaker attempts to move the primary PostgreSQL and virtual IP
> to the other node. The problem is that it attempts to stop the resources
> gracefully, which causes the pgsql resource to error with "Switchover has
> been canceled from pre-promote action" (which I believe is because
> PostgreSQL shuts down, but can't communicate with the standby during the
> shutdown - a similar situation to what is described here:
> https://github.com/ClusterLabs/PAF/issues/149)
> 
> Ideally, if the public network interface on the active node goes down I
> would want to take that node offline (either fence it or put it in standby
> mode, so that no resources can run on it), leaving just the other node in
> the cluster as the active node. Then the old primary can be rebuilt from
> the new primary in order to join the cluster again. However, I can't figure
> out a way to cause the active node to be fenced as a result of
> ocf:heartbeat:ethmonitor detecting that the interface has gone down.
> 
> Does anyone have any ideas/pointers how I could achieve this, or an
> alternative approach?
> 

If stopping resource fails, default pacemaker reaction is to fence the
node. Assuming "causes the pgsql resource to error" means "stopping
resource fails" it should already do what you want. Show logs from both
nodes around the time you simulate error.

> Hopefully that makes sense. Any help is appreciated!
> 
> Thanks.
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Fence node when network interface goes down

2021-11-12 Thread S Rogers
Hi, I'm hoping someone will be able to point me in the right direction.

I am configuring a two-node active/passive cluster that utilises the
PostgreSQL PAF resource agent. Each node has two NICs, therefore the
cluster is configured with two corosync links - one on each network (one
network is the public network, the other is effectively private and just
used for cluster communication). The cluster has a virtual IP resource,
which has a colocation constraint to keep it together with the primary
Postgres instance.

I am trying to protect against the scenario where the public network
interface on the active node goes down, in which case I want a failover to
occur and the other node to take over and host the primary Postgres
instance and the public virtual IP. My current approach is to use
ocf:heartbeat:ethmonitor to monitor the public interface along with a
location constraint to ensure that the virtual IP must be on a node where
the public interface is UP.

With this configuration, if I disconnect the active node from the public
network, Pacemaker attempts to move the primary PostgreSQL and virtual IP
to the other node. The problem is that it attempts to stop the resources
gracefully, which causes the pgsql resource to error with "Switchover has
been canceled from pre-promote action" (which I believe is because
PostgreSQL shuts down, but can't communicate with the standby during the
shutdown - a similar situation to what is described here:
https://github.com/ClusterLabs/PAF/issues/149)

Ideally, if the public network interface on the active node goes down I
would want to take that node offline (either fence it or put it in standby
mode, so that no resources can run on it), leaving just the other node in
the cluster as the active node. Then the old primary can be rebuilt from
the new primary in order to join the cluster again. However, I can't figure
out a way to cause the active node to be fenced as a result of
ocf:heartbeat:ethmonitor detecting that the interface has gone down.

Does anyone have any ideas/pointers how I could achieve this, or an
alternative approach?

Hopefully that makes sense. Any help is appreciated!

Thanks.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/