[ClusterLabs] [IMPORTANT] Upcoming changes to all ClusterLabs, corosync and knet repositories
All, following the current internet trends, all our github repositories will switch from the "master" branch to "main". This email is not meant to spark any discussion around the merit of those trends or offend anyone, it´s simple and technical and I would like to keep it that way. This is a multiple day transition that will start the 27th of Dec and end hopefully by the 30th of Dec. The repositories change will follow the published guideline from github: https://github.com/github/renaming and this will happen the 27th itself, pretty fast and hopefully painless. Upstream CI is the most affected service by this change. During those days CI will be unavailable or bumpy. Jenkins itself needs to do an internal transition to change the "master" node name, plus all the CI jobs configurations and all the helper scripts. There will be for sure disruptions in CI after the fact (for example all the CI generated repo files will be renamed). Details of those changes will be published after the transition is completed. If you have any external service / scripts relying on CI builds, be ready to adjust. I am not planning to add backward compatibility layers around. Cheers Fabio ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] sbd v1.5.1
Hi sbd - developers & users! Thanks to everybody for contributing to tests and further development. Changes since 1.5.1 - improve/fix cmdline handling - tell the actual watchdog device specified with -w - tolerate and strip any leading spaces of commandline option values - Sanitize numeric arguments - if start-delay enabled, not explicitly given and msgwait can't be read from disk (diskless) use 2 * watchdog-timeout - avoid using deprecated valloc for disk-io-buffers - avoid frequent alloc/free of aligned buffers to prevent fragmentation - fix memory-leak in one-time-allocations of sector-buffers - fix AIO-API usage: properly destroy io-context - improve/fix build environment - validate configure options for paths - remove unneeded complexity of configure.ac hierarchy - correctly derive package version from git (regression since 1.5.0) - make runstatedir configurable and derive from distribution Regards, Klaus ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fence node when network interface goes down
On Mon, Nov 15, 2021 at 3:32 PM S Rogers wrote: >> >> The only solution here - as long as fencing node on external >> connectivity loss is acceptable - is modifying ethmonitor RA to fail >> monitor operation in this case. > > I was hoping to find a way to achieve the desired outcome without resorting > to a custom RA, but it does appear to be the only solution. > Well, looking at it from a different angle - you could use the knet nozzle interface for replication which means your postgres connectivity is guaranteed to be the same as pacemaker/corosync connectivity. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Corosync 3.1.6 is available at corosync.org!
I am pleased to announce the latest maintenance release of Corosync 3.1.6 available immediately from GitHub release section at https://github.com/corosync/corosync/releases or our website at http://build.clusterlabs.org/corosync/releases/. This release contains MAJOR bugfix of totem protocol which caused loss or corruption of messages delivered during recovery phase. It is also important to pair this release with Kronosnet v1.23 (announcement https://lists.clusterlabs.org/pipermail/users/2021-November/029810.html) and Libqb 2.0.4 (announcement https://lists.clusterlabs.org/pipermail/users/2021-November/029811.html). All our development team would like to thank the Proxmox VE maintainer, Fabian Gruenbichler, for the extremely detailed bug reports, reproducers and collecting all the data from the affected Proxmox VE users, and his dedication over the past month to debug, test and work with us. Another important feature is addition of cancel_hold_on_retransmit option, which allows corosync to work in environments, where some packets are delayed more than other (caused by various Antivirus / IPS / IDS software). Complete changelog for 3.1.6: Christine Caulfield (1): cpghum: Allow to continue if corosync is restarted Jan Friesse (4): totem: Add cancel_hold_on_retransmit config option logsys: Unlock config mutex on error totemsrp: Switch totempg buffers at the right time build: Add explicit dependency for used libraries miharahiro (1): man: Fix consensus timeout This upgrade is required. Thanks/congratulations to all people that contributed to achieve this great milestone. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] [Announce] libqb 2.0.4 released
We are pleased to announce the release of libqb 2.0.4 Source code is available at: https://github.com/ClusterLabs/libqb/releases/ Please use the signed .tar.gz or .tar.xz files with the version number in rather than the github-generated "Source Code" ones. The most important fix in this release is that we no longer log errors inside the signal handler in loop_poll.c - this could cause an application hang in some circumstances. There is also a new implementation of the timerlist that should inprove performance when a large number of timers are active. shortlog: Chrissie Caulfield (3): doxygen2man: print structure descriptions (#443) Fix pthread returns (#444) poll: Don't log in a signal handler (#447) Jan Friesse (1): Implement heap based timer list (#439) orbea (1): build: Fix undefined pthread reference. (#440) ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] kronosnet v1.23 released
All, We are pleased to announce the general availability of kronosnet v1.23. This version contains MAJOR bug fixes and everybody is strongly encouraged to upgrade as soon as possible. The defrag buffer fixes introduced in v1.22, revealed a long standing bug in corosync, 2 serious bugs in knet that have been lurking around for approx 11 years, and one bug in libqb (spotted during testing of the fixes in this release). Please make sure to upgrade all of them as soon as possible. Corosync and libqb releases are happening (or have already happened) as this announcement is being sent out. All our development team would like to thank the Proxmox VE maintainer, Fabian Gruenbichler, for the extremely detailed bug reports, reproducers and collecting all the data from the affected Proxmox VE users, and his dedication over the past month to debug, test and work with us. kronosnet (or knet for short) is the new underlying network protocol for Linux HA components (corosync), that features the ability to use multiple links between nodes, active/active and active/passive link failover policies, automatic link recovery, FIPS compliant encryption (nss and/or openssl), automatic PMTUd and in general better performance compared to the old network protocol. Highlights in this release: * [URGENT] Fix packet sequence number initialization race * [URGENT] Fix UDP link down detection when other nodes restart too fast * [minor] Fix nss buffer boundaries * [minor] improved error logs to make it easier to debug improper setups * [minor] improved logging to not drop log messages on socket overload * Fix build with musl/glibc on archlinux * Enhance security build using annocheck / annobin * Minor bug fixes and enhancements in the test suite Known issues in this release: * The long standing SCTP problem with dynamic links (spotted while preparing v1.21) has not been addressed yet. The problem does NOT affect the corosync / High Availability use case. The source tarballs can be downloaded here: https://www.kronosnet.org/releases/ Upstream resources and contacts: https://kronosnet.org/ https://github.com/kronosnet/kronosnet/ https://ci.kronosnet.org/ https://trello.com/kronosnet (TODO list and activities tracking) https://goo.gl/9ZvkLS (google shared drive with presentations and diagrams) IRC: #kronosnet on Libera https://lists.kronosnet.org/mailman3/postorius/lists/users.lists.kronosnet.org/ https://lists.kronosnet.org/mailman3/postorius/lists/devel.lists.kronosnet.org/ https://lists.kronosnet.org/mailman3/postorius/lists/commits.lists.kronosnet.org/ Cheers, The knet developer team ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: [EXT] Re: Fence node when network interface goes down
>>> S Rogers schrieb am 15.11.2021 um 13:32 in >>> Nachricht <8c836815-8fac-115e-4eb0-c1e73933c...@gmail.com>: ... > Unfortunately, in some situations this cluster will be deployed in a > completely isolated network so there may not even be a router that we > can use as a ping target, and we can't guarantee the presence of any > other system on the network that we could reliably use as a ping target. ... You could configure the actual router's address as secondary IP on the interface while deploying, so the cluster will be happy. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fence node when network interface goes down
On 15/11/2021 12:03, Klaus Wenninger wrote: On Mon, Nov 15, 2021 at 12:19 PM Andrei Borzenkov wrote: On Mon, Nov 15, 2021 at 1:18 PM Klaus Wenninger wrote: > > > > On Mon, Nov 15, 2021 at 10:37 AM S Rogers wrote: >> >> I had thought about doing that, but the cluster is then dependent on the >> external system, and if that external system was to go down or become >> unreachable for any reason then it would falsely cause the cluster to >> failover or worse it could even take the cluster down completely, if the >> external system goes down and both nodes cannot ping it. > > You wouldn't necessarily have to ban resources from nodes that can't > reach the external network. It would be enough to make them prefer > the location that has connection. So if both lose connection one side > would still stay up. > Not to depend on something really external you might use the > router to your external network as ping target. > In case of fencing - triggered by whatever - and a potential fence-race The problem here is that nothing really triggers fencing. What happens, is Got that! Which is why I gave the hint how to prevent shutting down services with ping first. Taking care of what happens when nodes are fenced still makes sense. Imagine a fence-race where the node running services loses just to afterwards get the services moved back when it comes up again. Klaus Thanks, I wasn't aware of priority-fencing-delay. While it doesn't solve this problem, I can still use it to improve the fencing behaviour of the cluster in general. Unfortunately, in some situations this cluster will be deployed in a completely isolated network so there may not even be a router that we can use as a ping target, and we can't guarantee the presence of any other system on the network that we could reliably use as a ping target. - two postgres lose connection over external network, but cluster nodes retain connectivity over another network - postgres RA compares "latest timestamp" when selecting the best node to fail over to - primary postgres has better timestamp, so RA simply does not consider secondary as suitable for (atomatic) failover The only solution here - as long as fencing node on external connectivity loss is acceptable - is modifying ethmonitor RA to fail monitor operation in this case. I was hoping to find a way to achieve the desired outcome without resorting to a custom RA, but it does appear to be the only solution. This may not be the right audience, but does anyone know if it is a viable change to add an additional parameter to the ethmonitor RA that allows users to override the desired behaviour when the monitor operation fails? (ie, a 'monitor_force_fail' parameter that when set to true will cause the monitor operation to fail if it determines the interface is down) Being relatively new to pacemaker, I don't know whether this goes against RA conventions/practices. > you might use the rather new feature priority-fencing-delay (give the node > that is running valuable resources a benefit in the race) or go for > fence_heuristics_ping (pseudo fence-resource that together with a > fencing-topology prevents the node without access to a certain IP > from fencing the other node). > https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html > https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py > > Klaus > ___ >> >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home:https://www.clusterlabs.org/___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fence node when network interface goes down
On Mon, Nov 15, 2021 at 12:19 PM Andrei Borzenkov wrote: > On Mon, Nov 15, 2021 at 1:18 PM Klaus Wenninger > wrote: > > > > > > > > On Mon, Nov 15, 2021 at 10:37 AM S Rogers > wrote: > >> > >> I had thought about doing that, but the cluster is then dependent on the > >> external system, and if that external system was to go down or become > >> unreachable for any reason then it would falsely cause the cluster to > >> failover or worse it could even take the cluster down completely, if the > >> external system goes down and both nodes cannot ping it. > > > > You wouldn't necessarily have to ban resources from nodes that can't > > reach the external network. It would be enough to make them prefer > > the location that has connection. So if both lose connection one side > > would still stay up. > > Not to depend on something really external you might use the > > router to your external network as ping target. > > In case of fencing - triggered by whatever - and a potential fence-race > > The problem here is that nothing really triggers fencing. What happens, is > Got that! Which is why I gave the hint how to prevent shutting down services with ping first. Taking care of what happens when nodes are fenced still makes sense. Imagine a fence-race where the node running services loses just to afterwards get the services moved back when it comes up again. Klaus > > - two postgres lose connection over external network, but cluster > nodes retain connectivity over another network > - postgres RA compares "latest timestamp" when selecting the best node > to fail over to > - primary postgres has better timestamp, so RA simply does not > consider secondary as suitable for (atomatic) failover > > The only solution here - as long as fencing node on external > connectivity loss is acceptable - is modifying ethmonitor RA to fail > monitor operation in this case. > > > you might use the rather new feature priority-fencing-delay (give the > node > > that is running valuable resources a benefit in the race) or go for > > fence_heuristics_ping (pseudo fence-resource that together with a > > fencing-topology prevents the node without access to a certain IP > > from fencing the other node). > > > https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html > > > https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py > > > > Klaus > > ___ > >> > >> Manage your subscription: > >> https://lists.clusterlabs.org/mailman/listinfo/users > >> > >> ClusterLabs home: https://www.clusterlabs.org/ > >> > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fence node when network interface goes down
On Mon, Nov 15, 2021 at 1:18 PM Klaus Wenninger wrote: > > > > On Mon, Nov 15, 2021 at 10:37 AM S Rogers wrote: >> >> I had thought about doing that, but the cluster is then dependent on the >> external system, and if that external system was to go down or become >> unreachable for any reason then it would falsely cause the cluster to >> failover or worse it could even take the cluster down completely, if the >> external system goes down and both nodes cannot ping it. > > You wouldn't necessarily have to ban resources from nodes that can't > reach the external network. It would be enough to make them prefer > the location that has connection. So if both lose connection one side > would still stay up. > Not to depend on something really external you might use the > router to your external network as ping target. > In case of fencing - triggered by whatever - and a potential fence-race The problem here is that nothing really triggers fencing. What happens, is - two postgres lose connection over external network, but cluster nodes retain connectivity over another network - postgres RA compares "latest timestamp" when selecting the best node to fail over to - primary postgres has better timestamp, so RA simply does not consider secondary as suitable for (atomatic) failover The only solution here - as long as fencing node on external connectivity loss is acceptable - is modifying ethmonitor RA to fail monitor operation in this case. > you might use the rather new feature priority-fencing-delay (give the node > that is running valuable resources a benefit in the race) or go for > fence_heuristics_ping (pseudo fence-resource that together with a > fencing-topology prevents the node without access to a certain IP > from fencing the other node). > https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html > https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py > > Klaus > ___ >> >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fence node when network interface goes down
On Mon, Nov 15, 2021 at 10:37 AM S Rogers wrote: > I had thought about doing that, but the cluster is then dependent on the > external system, and if that external system was to go down or become > unreachable for any reason then it would falsely cause the cluster to > failover or worse it could even take the cluster down completely, if the > external system goes down and both nodes cannot ping it. > You wouldn't necessarily have to ban resources from nodes that can't reach the external network. It would be enough to make them prefer the location that has connection. So if both lose connection one side would still stay up. Not to depend on something really external you might use the router to your external network as ping target. In case of fencing - triggered by whatever - and a potential fence-race you might use the rather new feature priority-fencing-delay (give the node that is running valuable resources a benefit in the race) or go for fence_heuristics_ping (pseudo fence-resource that together with a fencing-topology prevents the node without access to a certain IP from fencing the other node). https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py Klaus ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fence node when network interface goes down
I had thought about doing that, but the cluster is then dependent on the external system, and if that external system was to go down or become unreachable for any reason then it would falsely cause the cluster to failover or worse it could even take the cluster down completely, if the external system goes down and both nodes cannot ping it. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fence node when network interface goes down
Have you tried with ping and a location constraint for avoiding hosts that cannot ping an extrrnal system. Best Regards,Strahil Nikolov On Mon, Nov 15, 2021 at 0:07, S Rogers wrote: Using on-fail=fence is what I initially tried, but it doesn't work unfortunately. It looks like this is because the ethmonitor monitor operation won't actually fail when it detects a downed interface. It'll only fail if it is unable to update the CIB, as per this comment: https://github.com/ClusterLabs/resource-agents/blob/4824a7a83765a0596b7d9856d00102f53c8ce123/heartbeat/ethmonitor#L518 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/