Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.
out Node2 starts. > > > > Is it correct? > > Before connectivity is lost, the cluster is presumably in a normal > state. That means the two nodes are passing the Corosync token > between > them. > > When connectivity is lost, both nodes see the token lost at roughly > the > same time. With priority-fencing-delay, node2 waits the specified > amount of time before fencing node1. node1 does not wait, so it > fences > node2, and node2 never gets the chance to fence node1. I forgot to mention: The fence_heuristics_ping fence agent can be used instead of or in addition to priority-fencing-delay. This agent has to be combined with a "real" fencing agent in a fencing topology. If a node can't ping an IP (generally the gateway or something just beyond it), it will refuse to execute the "real" fencing. This is intended for services that need to be externally available. If you do have quorum in addition to fencing, only the node with quorum will initiate fencing, so you don't need any fencing delay or heuristics in that case. > > > > Therefore, quorum-based management seems better way for my > > > > exact > > > > case. > > > > > > Unfortunately it's unsafe without fencing. > > You may say I am stupid, but I really can’t understand why quorum- > > based resource management is unreliable without fencing. > > May a host hold quorum bit longer than another host got quorum and > > run app. Probably, it may do this. > > But fencing is not immediate too. So, it can't protect for 100% > > from > > short-time parallel runs. > > Certainly -- with fencing enabled, the cluster will not recover > resources elsewhere until fencing succeeds. > > > > That does complicate the situation. Ideally there would be some > > > way > > > to request > > > the VM to be immediately destroyed (whether via fence_xvm, a > > > cloud > > > provider > > > API, or similar). > > What you mean by "destroyed"? Mean get down? > > Correct. For fencing purposes, it should not be a clean shutdown but > an > immediate halt. > > > > > Please, mind all the above is from my common sense and quite > > > > poor > > > > fundamental knowledge in clustering. And please be so kind to > > > > correct > > > > me if I am wrong at any point. > > > > > > > > Sincerely, > > > > > > > > Alex > > > > -Original Message- > > > > From: Users On Behalf Of Ken > > > > Gaillot > > > > Sent: Thursday, May 2, 2024 5:55 PM > > > > To: Cluster Labs - All topics related to open-source clustering > > > > welcomed > > > > Subject: Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: > > > > qdevice > > > > connenction disrupted. > > > > > > > > I don't see fencing times in here -- fencing is absolutely > > > > essential. > > > > > > > > With the setup you describe, I would drop qdevice. With > > > > fencing, > > > > quorum is not strictly required in a two-node cluster (two_node > > > > should > > > > be set in corosync.conf). You can set priority-fencing-delay to > > > > reduce > > > > the chance of simultaneous fencing. For VMs, you can use > > > > fence_xvm, > > > > which is extremely quick. > > > > > > > > On Thu, 2024-05-02 at 02:56 +0300, ale...@pavlyuts.ru wrote: > > > > > Hi All, > > > > > > > > > > I am trying to build application-specific 2-node failover > > > > > cluster > > > > > using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD > > > > > 9.2.9, > > > > > knet transport. > > > > > > > > > > For some reason I can’t use 3-node then I have to use > > > > > qnetd+qdevice > > > > > 3.0.1. > > > > > > > > > > The main goal Is to protect custom app which is not cluster- > > > > > aware by > > > > > itself. It is quite stateful, can’t store the state outside > > > > > memory > > > > > and take some time to get converged with other parts of the > > > > > system, > > > > > then the best scenario is “failover is a restart with same > > > > > config”, > > > > > but each unnecessary restart is painful. So, if failover > > > > > done, > > > &g
Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.
why quorum- > based resource management is unreliable without fencing. > May a host hold quorum bit longer than another host got quorum and > run app. Probably, it may do this. > But fencing is not immediate too. So, it can't protect for 100% from > short-time parallel runs. Certainly -- with fencing enabled, the cluster will not recover resources elsewhere until fencing succeeds. > > > That does complicate the situation. Ideally there would be some way > > to request > > the VM to be immediately destroyed (whether via fence_xvm, a cloud > > provider > > API, or similar). > What you mean by "destroyed"? Mean get down? Correct. For fencing purposes, it should not be a clean shutdown but an immediate halt. > > > > Please, mind all the above is from my common sense and quite poor > > > fundamental knowledge in clustering. And please be so kind to > > > correct > > > me if I am wrong at any point. > > > > > > Sincerely, > > > > > > Alex > > > -Original Message- > > > From: Users On Behalf Of Ken > > > Gaillot > > > Sent: Thursday, May 2, 2024 5:55 PM > > > To: Cluster Labs - All topics related to open-source clustering > > > welcomed > > > Subject: Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: > > > qdevice > > > connenction disrupted. > > > > > > I don't see fencing times in here -- fencing is absolutely > > > essential. > > > > > > With the setup you describe, I would drop qdevice. With fencing, > > > quorum is not strictly required in a two-node cluster (two_node > > > should > > > be set in corosync.conf). You can set priority-fencing-delay to > > > reduce > > > the chance of simultaneous fencing. For VMs, you can use > > > fence_xvm, > > > which is extremely quick. > > > > > > On Thu, 2024-05-02 at 02:56 +0300, ale...@pavlyuts.ru wrote: > > > > Hi All, > > > > > > > > I am trying to build application-specific 2-node failover > > > > cluster > > > > using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD > > > > 9.2.9, > > > > knet transport. > > > > > > > > For some reason I can’t use 3-node then I have to use > > > > qnetd+qdevice > > > > 3.0.1. > > > > > > > > The main goal Is to protect custom app which is not cluster- > > > > aware by > > > > itself. It is quite stateful, can’t store the state outside > > > > memory > > > > and take some time to get converged with other parts of the > > > > system, > > > > then the best scenario is “failover is a restart with same > > > > config”, > > > > but each unnecessary restart is painful. So, if failover done, > > > > app > > > > must retain on the backup node until it fail or admin push it > > > > back, > > > > this work well with stickiness param. > > > > > > > > So, the goal is to detect serving node fail ASAP and restart it > > > > ASAP > > > > on other node, using DRBD-synced config/data. ASAP means within > > > > 5- > > > > 7 > > > > sec, not 30 or more. > > > > > > > > I was tried different combinations of timing, and finally got > > > > acceptable result within 5 sec for the best case. But! The case > > > > is > > > > very unstable. > > > > > > > > My setup is a simple: two nodes on VM, and one more VM as > > > > arbiter > > > > (qnetd), VMs under Proxmox and connected by net via external > > > > ethernet switch to get closer to reality where “nodes VM” > > > > should > > > > locate as VM on different PHY hosts in one rack. > > > > > > > > Then, it was adjusted for faster detect and failover. > > > > In Corosync, left the token default 1000ms, but add > > > > “heartbeat_failures_allowed: 3”, this made corosync catch node > > > > failure for about 200ms (4x50ms heartbeat). > > > > Both qnet and qdevice was run > > > > with net_heartbeat_interval_min=200 > > > > to allow play with faster hearbeats and detects Also, > > > > quorum.device.net has timeout: 500, sync_timeout: 3000, algo: > > > > LMS. > > > > > > > > The testing is to issue “ate +%M:%S.%N && qm stop 201”, and > > > > then > >
Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.
On Fri, May 3, 2024 at 8:59 PM wrote: > Hi, > > > > Also, I've done wireshark capture and found great mess in TCP, it > > > seems like connection between qdevice and qnetd really stops for some > > > time and packets won't deliver. > > > > Could you check UDP? I guess there is a lot of UDP packets sent by > corosync > > which probably makes TCP to not go thru. > Very improbably. UPD itself can't prevent TCP from working, and 1GB links > seems too wide for corosync may overload it. > Also, overload usually leads to SOME packets drop, but there absolutely > other case: NO TCP packet passed, I got two captures from two side and I > see > that for some time each party sends TCP packets, but other party do not > receive it at all. > > > > > > > For my guess, it match corosync syncing activities, and I suspect that > > > corosync prevent any other traffic on the interface it use for rings. > > > > > > As I switch qnetd and qdevice to use different interface it seems to > > > work fine. > > > > Actually having dedicated interface just for corosync/knet traffic is > optimal > > solution. qdevice+qnetd on the other hand should be as close to > "customer" > as > > possible. > > > I am sure qnetd is not intended to proof of network reachability, it only > an > arbiter to provide quorum resolution. Therefore, as for me it is better to > keep it on the intra-cluster network with high priority transport. If we > need to make a solution based on network reachability, there other ways to > provide it. > This is an example how you could use network reachability to give preference to a node with better reachability in a 2-node-fencing-race. There is text in the code that should give you an idea how it is supposed to work. https://github.com/ClusterLabs/fence-agents/blob/main/agents/heuristics_ping/fence_heuristics_ping.py If you think of combining with priority-fencing ... Of course this idea can be applied for other ways of evaluation of a running node. I did implement fence_heuristics_ping both for an explicit use-case and to convey the basic idea back then - having in mind that others might come up with different examples. Guess the main idea of having qdevice+qnetd outside of each of the 2 data-centers (if we're talking about a scenario of this kind) is to be able to cover the case where one of these data-centers becomes disconnected for whatever reason. Correct me please if there is more to it! In this scenario you could use e.g. SBD watchdog-fencing to be able to safely recover resources from a disconnected data-center (or site of any kind) . Klaus > > So if you could have two interfaces (one just for corosync, second for > > qnetd+qdevice+publicly accessible services) it might be a solution? > > > Yes, this way it works, but I wish to know WHY it won't work on the shared > interface. > > > > So, the question is: does corosync really temporary blocks any other > > > traffic on the interface it uses? Or it is just a coincidence? If it > > > blocks, is > > > > Nope, no "blocking". But it sends quite some few UDP packets and I guess > it can > > really use all available bandwidth so no TCP goes thru. > Use all available 1GBps? Impossible. > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.
Hi, > > Also, I've done wireshark capture and found great mess in TCP, it > > seems like connection between qdevice and qnetd really stops for some > > time and packets won't deliver. > > Could you check UDP? I guess there is a lot of UDP packets sent by corosync > which probably makes TCP to not go thru. Very improbably. UPD itself can't prevent TCP from working, and 1GB links seems too wide for corosync may overload it. Also, overload usually leads to SOME packets drop, but there absolutely other case: NO TCP packet passed, I got two captures from two side and I see that for some time each party sends TCP packets, but other party do not receive it at all. > > > > For my guess, it match corosync syncing activities, and I suspect that > > corosync prevent any other traffic on the interface it use for rings. > > > > As I switch qnetd and qdevice to use different interface it seems to > > work fine. > > Actually having dedicated interface just for corosync/knet traffic is optimal > solution. qdevice+qnetd on the other hand should be as close to "customer" as > possible. > I am sure qnetd is not intended to proof of network reachability, it only an arbiter to provide quorum resolution. Therefore, as for me it is better to keep it on the intra-cluster network with high priority transport. If we need to make a solution based on network reachability, there other ways to provide it. > So if you could have two interfaces (one just for corosync, second for > qnetd+qdevice+publicly accessible services) it might be a solution? > Yes, this way it works, but I wish to know WHY it won't work on the shared interface. > > So, the question is: does corosync really temporary blocks any other > > traffic on the interface it uses? Or it is just a coincidence? If it > > blocks, is > > Nope, no "blocking". But it sends quite some few UDP packets and I guess it can > really use all available bandwidth so no TCP goes thru. Use all available 1GBps? Impossible. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.
Hi, > > Thanks great for your suggestion, probably I need to think about this > > way too, however, the project environment is not a good one to rely on > > fencing and, moreover, we can't control the bottom layer a trusted > > way. > > That is a problem. A VM being gone is not the only possible failure scenario. > For > example, a kernel or device driver issue could temporarily freeze the node, or > networking could temporarily drop out, causing the node to appear lost to > Corosync, but the node could be responsive again (with the app running) after > the > app has been started on the other node. > > If there's no problem with the app running on both nodes at the same time, > then > that's fine, but that's rarely the case. If an IP address is needed, or > shared storage > is used, simultaneous access will cause problems that only fencing can avoid. The pacemaker use very pessimistic approach if you set resources to require quorum. If a network outage is trigger changes, it will ruin quorum first and after that try to rebuild it. Therefore there are two questions: 1. How to keep active app running? 2. How to prevent two copies started. As for me, quorum-dependent resource management performs well on both points. > > my goal is to keep the app from moves (e.g. restarts) as long as > > possible. This means only two kinds of moves accepted: current host > > fail (move to other with restart) or admin move (managed move at > > certain time with restart). Any other troubles should NOT trigger app > > down/restart. Except of total connectivity loss where no second node, > > no arbiter => stop service. > > Total connectivity loss may not be permanent. Fencing ensures the connectivity > will not be restored after the app is started elsewhere. Nothing bad if it restored and the node alive, but got app down because of no quorum. > Pacemaker 2.0.4 and later supports priority-fencing-delay which allows the > node > currently running the app to survive. The node not running the app will wait > the > configured amount of time before trying to fence the other node. Of course > that > does add more time to the recovery if the node running the app is really gone. I feel I am not sure about how it works. Imagine just connectivity loss between nodes but no to the other pars. And Node1 runs app. Everything well, node2 off. So, we start Node2 with intention to restore cluster. Node 2 starts and trying to find it's partner, failure and fence node1 out. While Node1 not even know about Node2 starts. Is it correct? > > > > Therefore, quorum-based management seems better way for my exact case. > > Unfortunately it's unsafe without fencing. You may say I am stupid, but I really can’t understand why quorum-based resource management is unreliable without fencing. May a host hold quorum bit longer than another host got quorum and run app. Probably, it may do this. But fencing is not immediate too. So, it can't protect for 100% from short-time parallel runs. > That does complicate the situation. Ideally there would be some way to request > the VM to be immediately destroyed (whether via fence_xvm, a cloud provider > API, or similar). What you mean by "destroyed"? Mean get down? > > > > > Please, mind all the above is from my common sense and quite poor > > fundamental knowledge in clustering. And please be so kind to correct > > me if I am wrong at any point. > > > > Sincerely, > > > > Alex > > -Original Message- > > From: Users On Behalf Of Ken Gaillot > > Sent: Thursday, May 2, 2024 5:55 PM > > To: Cluster Labs - All topics related to open-source clustering > > welcomed > > Subject: Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice > > connenction disrupted. > > > > I don't see fencing times in here -- fencing is absolutely essential. > > > > With the setup you describe, I would drop qdevice. With fencing, > > quorum is not strictly required in a two-node cluster (two_node should > > be set in corosync.conf). You can set priority-fencing-delay to reduce > > the chance of simultaneous fencing. For VMs, you can use fence_xvm, > > which is extremely quick. > > > > On Thu, 2024-05-02 at 02:56 +0300, ale...@pavlyuts.ru wrote: > > > Hi All, > > > > > > I am trying to build application-specific 2-node failover cluster > > > using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9, > > > knet transport. > > > > > > For some reason I can’t use 3-node then I have to use > > > qnetd+qdevice > > > 3.0.1. > > > > > > The main goal I
Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.
Hi, some of your findings are really interesting. On 02/05/2024 01:56, ale...@pavlyuts.ru wrote: Hi All, I am trying to build application-specific 2-node failover cluster using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9, knet transport. ... Also, I've done wireshark capture and found great mess in TCP, it seems like connection between qdevice and qnetd really stops for some time and packets won't deliver. Could you check UDP? I guess there is a lot of UDP packets sent by corosync which probably makes TCP to not go thru. For my guess, it match corosync syncing activities, and I suspect that corosync prevent any other traffic on the interface it use for rings. As I switch qnetd and qdevice to use different interface it seems to work fine. Actually having dedicated interface just for corosync/knet traffic is optimal solution. qdevice+qnetd on the other hand should be as close to "customer" as possible. So if you could have two interfaces (one just for corosync, second for qnetd+qdevice+publicly accessible services) it might be a solution? So, the question is: does corosync really temporary blocks any other traffic on the interface it uses? Or it is just a coincidence? If it blocks, is Nope, no "blocking". But it sends quite some few UDP packets and I guess it can really use all available bandwidth so no TCP goes thru. Honza there a way to manage it? Thank you for any suggest on that! Sincerely, Alex ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.
On Thu, 2024-05-02 at 22:56 +0300, ale...@pavlyuts.ru wrote: > Dear Ken, > > First of all, there no fencing at all, it is off. > > Thanks great for your suggestion, probably I need to think about this > way too, however, the project environment is not a good one to rely > on fencing and, moreover, we can't control the bottom layer a trusted > way. That is a problem. A VM being gone is not the only possible failure scenario. For example, a kernel or device driver issue could temporarily freeze the node, or networking could temporarily drop out, causing the node to appear lost to Corosync, but the node could be responsive again (with the app running) after the app has been started on the other node. If there's no problem with the app running on both nodes at the same time, then that's fine, but that's rarely the case. If an IP address is needed, or shared storage is used, simultaneous access will cause problems that only fencing can avoid. > > As I understand, fence_xvm just kills VM that not inside the quorum > part, or, in a case of two-host just one survive who shoot first. But Correct > my goal is to keep the app from moves (e.g. restarts) as long as > possible. This means only two kinds of moves accepted: current host > fail (move to other with restart) or admin move (managed move at > certain time with restart). Any other troubles should NOT trigger app > down/restart. Except of total connectivity loss where no second node, > no arbiter => stop service. > Total connectivity loss may not be permanent. Fencing ensures the connectivity will not be restored after the app is started elsewhere. > AFAIK, fencing in two-nodes creates undetermined fence racing, and > even it warrants only one node survive, it has no respect to if the > app already runs on the node or not. So, the situation: one node > already run app, while other lost its connection to the first, but > not to the fence device. And win the race => kill current active => > app restarts. That's exactly what I am trying to avoid. Pacemaker 2.0.4 and later supports priority-fencing-delay which allows the node currently running the app to survive. The node not running the app will wait the configured amount of time before trying to fence the other node. Of course that does add more time to the recovery if the node running the app is really gone. > > Therefore, quorum-based management seems better way for my exact > case. Unfortunately it's unsafe without fencing. > > Also, VM fencing rely on the idea that all VMs are inside a well- > managed first layer cluster with it's own quorum/fencing on place or > separate nodes and VMs never moved between without careful fencing > reconfig. In mu case, I can't be sure about both points, I do not > manage bottom layer. The max I can do is to request that every my MV > (node, arbiter) located on different phy node and this may protect > app from node failure and bring more freedom to get nodes off for > service. Also, I have to limit overall MV count while there need for > multiple app instances (VM pairs) running at once and one extra VM as > arbiter for all them (2*N+1), but not 3-node for each instance (3*N) > which could be more reasonable for my opinion, but not for one who > allocate resources. That does complicate the situation. Ideally there would be some way to request the VM to be immediately destroyed (whether via fence_xvm, a cloud provider API, or similar). > > Please, mind all the above is from my common sense and quite poor > fundamental knowledge in clustering. And please be so kind to correct > me if I am wrong at any point. > > Sincerely, > > Alex > -----Original Message- > From: Users On Behalf Of Ken Gaillot > Sent: Thursday, May 2, 2024 5:55 PM > To: Cluster Labs - All topics related to open-source clustering > welcomed > Subject: Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice > connenction disrupted. > > I don't see fencing times in here -- fencing is absolutely essential. > > With the setup you describe, I would drop qdevice. With fencing, > quorum is not strictly required in a two-node cluster (two_node > should be set in corosync.conf). You can set priority-fencing-delay > to reduce the chance of simultaneous fencing. For VMs, you can use > fence_xvm, which is extremely quick. > > On Thu, 2024-05-02 at 02:56 +0300, ale...@pavlyuts.ru wrote: > > Hi All, > > > > I am trying to build application-specific 2-node failover cluster > > using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9, > > knet > > transport. > > > > For some reason I can’t use 3-node then I have to use > > qnetd+qdevice > > 3.0.1. > > &
Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.
Dear Ken, First of all, there no fencing at all, it is off. Thanks great for your suggestion, probably I need to think about this way too, however, the project environment is not a good one to rely on fencing and, moreover, we can't control the bottom layer a trusted way. As I understand, fence_xvm just kills VM that not inside the quorum part, or, in a case of two-host just one survive who shoot first. But my goal is to keep the app from moves (e.g. restarts) as long as possible. This means only two kinds of moves accepted: current host fail (move to other with restart) or admin move (managed move at certain time with restart). Any other troubles should NOT trigger app down/restart. Except of total connectivity loss where no second node, no arbiter => stop service. AFAIK, fencing in two-nodes creates undetermined fence racing, and even it warrants only one node survive, it has no respect to if the app already runs on the node or not. So, the situation: one node already run app, while other lost its connection to the first, but not to the fence device. And win the race => kill current active => app restarts. That's exactly what I am trying to avoid. Therefore, quorum-based management seems better way for my exact case. Also, VM fencing rely on the idea that all VMs are inside a well-managed first layer cluster with it's own quorum/fencing on place or separate nodes and VMs never moved between without careful fencing reconfig. In mu case, I can't be sure about both points, I do not manage bottom layer. The max I can do is to request that every my MV (node, arbiter) located on different phy node and this may protect app from node failure and bring more freedom to get nodes off for service. Also, I have to limit overall MV count while there need for multiple app instances (VM pairs) running at once and one extra VM as arbiter for all them (2*N+1), but not 3-node for each instance (3*N) which could be more reasonable for my opinion, but not for one who allocate resources. Please, mind all the above is from my common sense and quite poor fundamental knowledge in clustering. And please be so kind to correct me if I am wrong at any point. Sincerely, Alex -Original Message- From: Users On Behalf Of Ken Gaillot Sent: Thursday, May 2, 2024 5:55 PM To: Cluster Labs - All topics related to open-source clustering welcomed Subject: Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted. I don't see fencing times in here -- fencing is absolutely essential. With the setup you describe, I would drop qdevice. With fencing, quorum is not strictly required in a two-node cluster (two_node should be set in corosync.conf). You can set priority-fencing-delay to reduce the chance of simultaneous fencing. For VMs, you can use fence_xvm, which is extremely quick. On Thu, 2024-05-02 at 02:56 +0300, ale...@pavlyuts.ru wrote: > Hi All, > > I am trying to build application-specific 2-node failover cluster > using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9, knet > transport. > > For some reason I can’t use 3-node then I have to use qnetd+qdevice > 3.0.1. > > The main goal Is to protect custom app which is not cluster-aware by > itself. It is quite stateful, can’t store the state outside memory and > take some time to get converged with other parts of the system, then > the best scenario is “failover is a restart with same config”, but > each unnecessary restart is painful. So, if failover done, app must > retain on the backup node until it fail or admin push it back, this > work well with stickiness param. > > So, the goal is to detect serving node fail ASAP and restart it ASAP > on other node, using DRBD-synced config/data. ASAP means within 5-7 > sec, not 30 or more. > > I was tried different combinations of timing, and finally got > acceptable result within 5 sec for the best case. But! The case is > very unstable. > > My setup is a simple: two nodes on VM, and one more VM as arbiter > (qnetd), VMs under Proxmox and connected by net via external ethernet > switch to get closer to reality where “nodes VM” should locate as VM > on different PHY hosts in one rack. > > Then, it was adjusted for faster detect and failover. > In Corosync, left the token default 1000ms, but add > “heartbeat_failures_allowed: 3”, this made corosync catch node failure > for about 200ms (4x50ms heartbeat). > Both qnet and qdevice was run with net_heartbeat_interval_min=200 to > allow play with faster hearbeats and detects Also, quorum.device.net > has timeout: 500, sync_timeout: 3000, algo: > LMS. > > The testing is to issue “ate +%M:%S.%N && qm stop 201”, and then check > the logs on timestamp when the app started on the “backup” > host. And, when backup host boot
Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.
I don't see fencing times in here -- fencing is absolutely essential. With the setup you describe, I would drop qdevice. With fencing, quorum is not strictly required in a two-node cluster (two_node should be set in corosync.conf). You can set priority-fencing-delay to reduce the chance of simultaneous fencing. For VMs, you can use fence_xvm, which is extremely quick. On Thu, 2024-05-02 at 02:56 +0300, ale...@pavlyuts.ru wrote: > Hi All, > > I am trying to build application-specific 2-node failover cluster > using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9, > knet transport. > > For some reason I can’t use 3-node then I have to use qnetd+qdevice > 3.0.1. > > The main goal Is to protect custom app which is not cluster-aware by > itself. It is quite stateful, can’t store the state outside memory > and take some time to get converged with other parts of the system, > then the best scenario is “failover is a restart with same config”, > but each unnecessary restart is painful. So, if failover done, app > must retain on the backup node until it fail or admin push it back, > this work well with stickiness param. > > So, the goal is to detect serving node fail ASAP and restart it ASAP > on other node, using DRBD-synced config/data. ASAP means within 5-7 > sec, not 30 or more. > > I was tried different combinations of timing, and finally got > acceptable result within 5 sec for the best case. But! The case is > very unstable. > > My setup is a simple: two nodes on VM, and one more VM as arbiter > (qnetd), VMs under Proxmox and connected by net via external ethernet > switch to get closer to reality where “nodes VM” should locate as VM > on different PHY hosts in one rack. > > Then, it was adjusted for faster detect and failover. > In Corosync, left the token default 1000ms, but add > “heartbeat_failures_allowed: 3”, this made corosync catch node > failure for about 200ms (4x50ms heartbeat). > Both qnet and qdevice was run with net_heartbeat_interval_min=200 to > allow play with faster hearbeats and detects > Also, quorum.device.net has timeout: 500, sync_timeout: 3000, algo: > LMS. > > The testing is to issue “ate +%M:%S.%N && qm stop 201”, and then > check the logs on timestamp when the app started on the “backup” > host. And, when backup host boot again, the test is to check the logs > for the app was not restarted. > > Sometimes switchover work like a charm but sometimes it may delay for > dozens of seconds. > Sometimes when the primary host boot up again, secondary hold quorum > well and keep app running, sometimes quorum is lost first (and > pacemaker downs the app) and then found and pacemaker get app up > again, so unwanted restart happen. > > My investigation shows that the difference between “good” and “bad” > cases: > > Good case - all the logs clear and reasonable. > > Bad case: qnetd losing connection to second node just after the > connection to “failure” node detected and then it may take dozens of > seconds to restore it. All this time qdevice trying to connect qnetd > and fails: > > Example, host 192.168.100.1 send to failure, 100.2 is failover to: > > From qnetd: > May 01 23:30:39 arbiter corosync-qnetd[6338]: Client > :::192.168.100.1:60686 doesn't sent any message during 600ms. > Disconnecting > May 01 23:30:39 arbiter corosync-qnetd[6338]: Client > :::192.168.100.1:60686 (init_received 1, cluster bsc-test- > cluster, node_id 1) disconnect > May 01 23:30:39 arbiter corosync-qnetd[6338]: algo-lms: Client > 0x55a6fc6785b0 (cluster bsc-test-cluster, node_id 1) disconnect > May 01 23:30:39 arbiter corosync-qnetd[6338]: algo-lms: server > going down 0 > >>> This is unexpected down, at normal scenario connection persist > May 01 23:30:40 arbiter corosync-qnetd[6338]: Client > :::192.168.100.2:32790 doesn't sent any message during 600ms. > Disconnecting > May 01 23:30:40 arbiter corosync-qnetd[6338]: Client > :::192.168.100.2:32790 (init_received 1, cluster bsc-test- > cluster, node_id 2) disconnect > May 01 23:30:40 arbiter corosync-qnetd[6338]: algo-lms: Client > 0x55a6fc6363d0 (cluster bsc-test-cluster, node_id 2) disconnect > May 01 23:30:40 arbiter corosync-qnetd[6338]: algo-lms: server > going down 0 > May 01 23:30:56 arbiter corosync-qnetd[6338]: New client connected > May 01 23:30:56 arbiter corosync-qnetd[6338]: cluster name = bsc- > test-cluster > May 01 23:30:56 arbiter corosync-qnetd[6338]: tls started = 0 > May 01 23:30:56 arbiter corosync-qnetd[6338]: tls peer certificate > verified = 0 > May 01 23:30:56 arbiter corosync-qnetd[6338]: node_id = 2 > May 01 23:30:56 arbiter corosync-qnetd[6338]: pointer = > 0x55a6fc6363d0 > May 01 23:30:56 arbiter corosync-qnetd[6338]: addr_str = > :::192.168.100.2:57736 > May 01 23:30:56 arbiter corosync-qnetd[6338]: ring id = (2.801) > May 01 23:30:56 arbiter corosync-qnetd[6338]: cluster dump: > May 01 23:30:56 arbiter corosync-qnetd[6338]: client = > :::192.168
[ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.
Hi All, I am trying to build application-specific 2-node failover cluster using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9, knet transport. For some reason I can't use 3-node then I have to use qnetd+qdevice 3.0.1. The main goal Is to protect custom app which is not cluster-aware by itself. It is quite stateful, can't store the state outside memory and take some time to get converged with other parts of the system, then the best scenario is "failover is a restart with same config", but each unnecessary restart is painful. So, if failover done, app must retain on the backup node until it fail or admin push it back, this work well with stickiness param. So, the goal is to detect serving node fail ASAP and restart it ASAP on other node, using DRBD-synced config/data. ASAP means within 5-7 sec, not 30 or more. I was tried different combinations of timing, and finally got acceptable result within 5 sec for the best case. But! The case is very unstable. My setup is a simple: two nodes on VM, and one more VM as arbiter (qnetd), VMs under Proxmox and connected by net via external ethernet switch to get closer to reality where "nodes VM" should locate as VM on different PHY hosts in one rack. Then, it was adjusted for faster detect and failover. 1. In Corosync, left the token default 1000ms, but add "heartbeat_failures_allowed: 3", this made corosync catch node failure for about 200ms (4x50ms heartbeat). 2. Both qnet and qdevice was run with net_heartbeat_interval_min=200 to allow play with faster hearbeats and detects 3. Also, quorum.device.net has timeout: 500, sync_timeout: 3000, algo: LMS. The testing is to issue "ate +%M:%S.%N && qm stop 201", and then check the logs on timestamp when the app started on the "backup" host. And, when backup host boot again, the test is to check the logs for the app was not restarted. Sometimes switchover work like a charm but sometimes it may delay for dozens of seconds. Sometimes when the primary host boot up again, secondary hold quorum well and keep app running, sometimes quorum is lost first (and pacemaker downs the app) and then found and pacemaker get app up again, so unwanted restart happen. My investigation shows that the difference between "good" and "bad" cases: Good case - all the logs clear and reasonable. Bad case: qnetd losing connection to second node just after the connection to "failure" node detected and then it may take dozens of seconds to restore it. All this time qdevice trying to connect qnetd and fails: Example, host 192.168.100.1 send to failure, 100.2 is failover to: >From qnetd: May 01 23:30:39 arbiter corosync-qnetd[6338]: Client :::192.168.100.1:60686 doesn't sent any message during 600ms. Disconnecting May 01 23:30:39 arbiter corosync-qnetd[6338]: Client :::192.168.100.1:60686 (init_received 1, cluster bsc-test-cluster, node_id 1) disconnect May 01 23:30:39 arbiter corosync-qnetd[6338]: algo-lms: Client 0x55a6fc6785b0 (cluster bsc-test-cluster, node_id 1) disconnect May 01 23:30:39 arbiter corosync-qnetd[6338]: algo-lms: server going down 0 >>> This is unexpected down, at normal scenario connection persist May 01 23:30:40 arbiter corosync-qnetd[6338]: Client :::192.168.100.2:32790 doesn't sent any message during 600ms. Disconnecting May 01 23:30:40 arbiter corosync-qnetd[6338]: Client :::192.168.100.2:32790 (init_received 1, cluster bsc-test-cluster, node_id 2) disconnect May 01 23:30:40 arbiter corosync-qnetd[6338]: algo-lms: Client 0x55a6fc6363d0 (cluster bsc-test-cluster, node_id 2) disconnect May 01 23:30:40 arbiter corosync-qnetd[6338]: algo-lms: server going down 0 May 01 23:30:56 arbiter corosync-qnetd[6338]: New client connected May 01 23:30:56 arbiter corosync-qnetd[6338]: cluster name = bsc-test-cluster May 01 23:30:56 arbiter corosync-qnetd[6338]: tls started = 0 May 01 23:30:56 arbiter corosync-qnetd[6338]: tls peer certificate verified = 0 May 01 23:30:56 arbiter corosync-qnetd[6338]: node_id = 2 May 01 23:30:56 arbiter corosync-qnetd[6338]: pointer = 0x55a6fc6363d0 May 01 23:30:56 arbiter corosync-qnetd[6338]: addr_str = :::192.168.100.2:57736 May 01 23:30:56 arbiter corosync-qnetd[6338]: ring id = (2.801) May 01 23:30:56 arbiter corosync-qnetd[6338]: cluster dump: May 01 23:30:56 arbiter corosync-qnetd[6338]: client = :::192.168.100.2:57736, node_id = 2 May 01 23:30:56 arbiter corosync-qnetd[6338]: Client :::192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent initial node list. May 01 23:30:56 arbiter corosync-qnetd[6338]: msg seq num = 99 May 01 23:30:56 arbiter corosync-qnetd[6338]: Node list: May 01 23:30:56 arbiter corosync-qnetd[6338]: 0 node_id = 1, data_center_id = 0, node_state = not set May 01 23:30:56 arbiter corosync-qnetd[6338]: 1 node_id = 2, data_center_id = 0, node_state = not set May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-