Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-06 Thread Ken Gaillot
On Mon, 2024-05-06 at 10:05 -0500, Ken Gaillot wrote:
> On Fri, 2024-05-03 at 16:18 +0300, ale...@pavlyuts.ru wrote:
> > Hi,
> > 
> > > > Thanks great for your suggestion, probably I need to think
> > > > about
> > > > this
> > > > way too, however, the project environment is not a good one to
> > > > rely on
> > > > fencing and, moreover, we can't control the bottom layer a
> > > > trusted
> > > > way.
> > > 
> > > That is a problem. A VM being gone is not the only possible
> > > failure
> > > scenario. For
> > > example, a kernel or device driver issue could temporarily freeze
> > > the node, or
> > > networking could temporarily drop out, causing the node to appear
> > > lost to
> > > Corosync, but the node could be responsive again (with the app
> > > running) after the
> > > app has been started on the other node.
> > > 
> > > If there's no problem with the app running on both nodes at the
> > > same time, then
> > > that's fine, but that's rarely the case. If an IP address is
> > > needed, or shared storage
> > > is used, simultaneous access will cause problems that only
> > > fencing
> > > can avoid.
> > The pacemaker use very pessimistic approach if you set resources to
> > require quorum. 
> > If a network outage is trigger changes, it will ruin quorum first
> > and
> > after that try to rebuild it. Therefore there are two questions: 
> > 1. How to keep active app running?
> > 2. How to prevent two copies started.
> > As for me, quorum-dependent resource management performs well on
> > both
> > points.
> 
> That's fine as long as the cluster is behaving properly. Fencing is
> for
> when it's not.
> 
> Quorum prevents multiple copies only if the nodes can communicate and
> operate normally. There are many situations when that's not true: a
> device driver or kernel bug locks up a node for more than the
> Corosync
> token timeout, CPU or I/O load gets high enough for a node to become
> unresponsive for long stretches of time, a failing network controller
> randomly drops large numbers of packets, etc.
> 
> In such situations, a node that appears lost to the other nodes may
> actually just be temporarily unreachable, and may come back at any
> moment (with its resources still active).
> 
> If an IP address is active in more than one location, packets will be
> randomly routed to one node or another, rendering all communication
> via
> that IP useless. If an application that uses shared storage is active
> in more than one location, data can be corrupted. And so forth.
> 
> Fencing ensures that the lost node is *definitely* not running
> resources before recovering them elsewhere.
> 
> > > > my goal is to keep the app from moves (e.g. restarts) as long
> > > > as
> > > > possible. This means only two kinds of moves accepted: current
> > > > host
> > > > fail (move to other with restart) or admin move (managed move
> > > > at
> > > > certain time with restart). Any other troubles should NOT
> > > > trigger
> > > > app
> > > > down/restart. Except of total connectivity loss where no second
> > > > node,
> > > > no arbiter => stop service.
> > > 
> > > Total connectivity loss may not be permanent. Fencing ensures the
> > > connectivity
> > > will not be restored after the app is started elsewhere.
> > Nothing bad if it restored and the node alive, but got app down
> > because of no quorum.
> 
> Again, that assumes it is operating normally. HA is all about the
> times
> when it's not.
>  
> > > Pacemaker 2.0.4 and later supports priority-fencing-delay which
> > > allows the node
> > > currently running the app to survive. The node not running the
> > > app
> > > will wait the
> > > configured amount of time before trying to fence the other node.
> > > Of
> > > course that
> > > does add more time to the recovery if the node running the app is
> > > really gone.
> > I feel I am not sure about how it works.
> > Imagine just connectivity loss between nodes but no to the other
> > pars.
> > And Node1 runs app. Everything well, node2 off.
> > So, we start Node2 with intention to restore cluster.
> > Node 2 starts and trying to find it's partner, failure and fence
> > node1 out.
> > While Node1 not even know about Node2 starts.
> > 
> > Is it correct?
> 
> Before connectivity is lost, the cluster is presumably in a normal
> state. That means the two nodes are passing the Corosync token
> between
> them.
> 
> When connectivity is lost, both nodes see the token lost at roughly
> the
> same time. With priority-fencing-delay, node2 waits the specified
> amount of time before fencing node1. node1 does not wait, so it
> fences
> node2, and node2 never gets the chance to fence node1.

I forgot to mention:

The fence_heuristics_ping fence agent can be used instead of or in
addition to priority-fencing-delay. This agent has to be combined with
a "real" fencing agent in a fencing topology. If a node can't ping an
IP (generally the gateway or something just beyond it), it will refuse
to execute the "real" fencing. This 

Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-06 Thread Ken Gaillot
On Fri, 2024-05-03 at 16:18 +0300, ale...@pavlyuts.ru wrote:
> Hi,
> 
> > > Thanks great for your suggestion, probably I need to think about
> > > this
> > > way too, however, the project environment is not a good one to
> > > rely on
> > > fencing and, moreover, we can't control the bottom layer a
> > > trusted
> > > way.
> > 
> > That is a problem. A VM being gone is not the only possible failure
> > scenario. For
> > example, a kernel or device driver issue could temporarily freeze
> > the node, or
> > networking could temporarily drop out, causing the node to appear
> > lost to
> > Corosync, but the node could be responsive again (with the app
> > running) after the
> > app has been started on the other node.
> > 
> > If there's no problem with the app running on both nodes at the
> > same time, then
> > that's fine, but that's rarely the case. If an IP address is
> > needed, or shared storage
> > is used, simultaneous access will cause problems that only fencing
> > can avoid.
> The pacemaker use very pessimistic approach if you set resources to
> require quorum. 
> If a network outage is trigger changes, it will ruin quorum first and
> after that try to rebuild it. Therefore there are two questions: 
> 1. How to keep active app running?
> 2. How to prevent two copies started.
> As for me, quorum-dependent resource management performs well on both
> points.

That's fine as long as the cluster is behaving properly. Fencing is for
when it's not.

Quorum prevents multiple copies only if the nodes can communicate and
operate normally. There are many situations when that's not true: a
device driver or kernel bug locks up a node for more than the Corosync
token timeout, CPU or I/O load gets high enough for a node to become
unresponsive for long stretches of time, a failing network controller
randomly drops large numbers of packets, etc.

In such situations, a node that appears lost to the other nodes may
actually just be temporarily unreachable, and may come back at any
moment (with its resources still active).

If an IP address is active in more than one location, packets will be
randomly routed to one node or another, rendering all communication via
that IP useless. If an application that uses shared storage is active
in more than one location, data can be corrupted. And so forth.

Fencing ensures that the lost node is *definitely* not running
resources before recovering them elsewhere.

> 
> > > my goal is to keep the app from moves (e.g. restarts) as long as
> > > possible. This means only two kinds of moves accepted: current
> > > host
> > > fail (move to other with restart) or admin move (managed move at
> > > certain time with restart). Any other troubles should NOT trigger
> > > app
> > > down/restart. Except of total connectivity loss where no second
> > > node,
> > > no arbiter => stop service.
> > 
> > Total connectivity loss may not be permanent. Fencing ensures the
> > connectivity
> > will not be restored after the app is started elsewhere.
> Nothing bad if it restored and the node alive, but got app down
> because of no quorum.

Again, that assumes it is operating normally. HA is all about the times
when it's not.
 
> > Pacemaker 2.0.4 and later supports priority-fencing-delay which
> > allows the node
> > currently running the app to survive. The node not running the app
> > will wait the
> > configured amount of time before trying to fence the other node. Of
> > course that
> > does add more time to the recovery if the node running the app is
> > really gone.
> I feel I am not sure about how it works.
> Imagine just connectivity loss between nodes but no to the other
> pars.
> And Node1 runs app. Everything well, node2 off.
> So, we start Node2 with intention to restore cluster.
> Node 2 starts and trying to find it's partner, failure and fence
> node1 out.
> While Node1 not even know about Node2 starts.
> 
> Is it correct?

Before connectivity is lost, the cluster is presumably in a normal
state. That means the two nodes are passing the Corosync token between
them.

When connectivity is lost, both nodes see the token lost at roughly the
same time. With priority-fencing-delay, node2 waits the specified
amount of time before fencing node1. node1 does not wait, so it fences
node2, and node2 never gets the chance to fence node1.

> 
> > > Therefore, quorum-based management seems better way for my exact
> > > case.
> > 
> > Unfortunately it's unsafe without fencing.
> You may say I am stupid, but I really can’t understand why quorum-
> based resource management is unreliable without fencing.
> May a host hold quorum bit longer than another host got quorum and
> run app. Probably, it may do this.
> But fencing is not immediate too. So, it can't protect for 100% from
> short-time parallel runs.

Certainly -- with fencing enabled, the cluster will not recover
resources elsewhere until fencing succeeds.

> 
> > That does complicate the situation. Ideally there would be some way
> > to request
> > 

Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-06 Thread Klaus Wenninger
On Fri, May 3, 2024 at 8:59 PM  wrote:

> Hi,
>
> > > Also, I've done wireshark capture and found great mess in TCP, it
> > > seems like connection between qdevice and qnetd really stops for some
> > > time and packets won't deliver.
> >
> > Could you check UDP? I guess there is a lot of UDP packets sent by
> corosync
> > which probably makes TCP to not go thru.
> Very improbably.  UPD itself can't prevent TCP from working, and 1GB links
> seems too wide for corosync may overload it.
> Also, overload usually leads to SOME packets drop, but there absolutely
> other case: NO TCP packet passed, I got two captures from two side and I
> see
> that for some time each party sends TCP packets, but other party do not
> receive it at all.
>
> > >
> > > For my guess, it match corosync syncing activities, and I suspect that
> > > corosync prevent any other traffic on the interface it use for rings.
> > >
> > > As I switch qnetd and qdevice to use different interface it seems to
> > > work fine.
> >
> > Actually having dedicated interface just for corosync/knet traffic is
> optimal
> > solution. qdevice+qnetd on the other hand should be as close to
> "customer"
> as
> > possible.
> >
> I am sure qnetd is not intended to proof of network reachability, it only
> an
> arbiter to provide quorum resolution. Therefore, as for me it is better to
> keep it on the intra-cluster network with high priority transport. If we
> need to make a solution based on network reachability, there other ways to
> provide it.
>

This is an example how you could use network reachability to give
preference to a node with better reachability in a 2-node-fencing-race.
There is text in the code that should give you an idea how it is supposed
to work.
https://github.com/ClusterLabs/fence-agents/blob/main/agents/heuristics_ping/fence_heuristics_ping.py

If you think of combining with priority-fencing ...
Of course this idea can be applied for other ways of evaluation of a
running node. I did implement fence_heuristics_ping both for an
explicit use-case and to convey the basic idea back then - having in mind
that
others might come up with different examples.

Guess the main idea of having qdevice+qnetd outside of each of the
2 data-centers (if we're talking about a scenario of this kind) is to be
able to cover the case where one of these data-centers becomes
disconnected for whatever reason. Correct me please if there is more to it!
In this scenario you could use e.g. SBD watchdog-fencing to be able
to safely recover resources from a disconnected data-center (or site of any
kind) .

Klaus


> > So if you could have two interfaces (one just for corosync, second for
> > qnetd+qdevice+publicly accessible services) it might be a solution?
> >
> Yes, this way it works, but I wish to know WHY it won't work on the shared
> interface.
>
> > > So, the question is: does corosync really temporary blocks any other
> > > traffic on the interface it uses? Or it is just a coincidence? If it
> > > blocks, is
> >
> > Nope, no "blocking". But it sends quite some few UDP packets and I guess
> it can
> > really use all available bandwidth so no TCP goes thru.
> Use all available 1GBps? Impossible.
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/