[ClusterLabs] [IMPORTANT] Upcoming changes to all ClusterLabs, corosync and knet repositories

2021-11-15 Thread Fabio M. Di Nitto

All,

following the current internet trends, all our github repositories will 
switch from the "master" branch to "main".


This email is not meant to spark any discussion around the merit of 
those trends or offend anyone, it´s simple and technical and I would 
like to keep it that way.


This is a multiple day transition that will start the 27th of Dec and 
end hopefully by the 30th of Dec.


The repositories change will follow the published guideline from github:
https://github.com/github/renaming

and this will happen the 27th itself, pretty fast and hopefully painless.

Upstream CI is the most affected service by this change. During those 
days CI will be unavailable or bumpy.


Jenkins itself needs to do an internal transition to change the "master" 
node name, plus all the CI jobs configurations and all the helper scripts.


There will be for sure disruptions in CI after the fact (for example all 
the CI generated repo files will be renamed).


Details of those changes will be published after the transition is 
completed. If you have any external service / scripts relying on CI 
builds, be ready to adjust. I am not planning to add backward 
compatibility layers around.


Cheers
Fabio
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] sbd v1.5.1

2021-11-15 Thread Klaus Wenninger
Hi sbd - developers & users!

Thanks to everybody for contributing to tests and
further development.


Changes since 1.5.1

- improve/fix cmdline handling
  - tell the actual watchdog device specified with -w
  - tolerate and strip any leading spaces of commandline option values
  - Sanitize numeric arguments
- if start-delay enabled, not explicitly given and msgwait can't be
  read from disk (diskless) use 2 * watchdog-timeout
- avoid using deprecated valloc for disk-io-buffers
- avoid frequent alloc/free of aligned buffers to prevent fragmentation
- fix memory-leak in one-time-allocations of sector-buffers
- fix AIO-API usage: properly destroy io-context
- improve/fix build environment
  - validate configure options for paths
  - remove unneeded complexity of configure.ac hierarchy
  - correctly derive package version from git (regression since 1.5.0)
  - make runstatedir configurable and derive from distribution


Regards,
Klaus
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-15 Thread Andrei Borzenkov
On Mon, Nov 15, 2021 at 3:32 PM S Rogers  wrote:
>>
>> The only solution here - as long as fencing node on external
>> connectivity loss is acceptable - is modifying ethmonitor RA to fail
>> monitor operation in this case.
>
> I was hoping to find a way to achieve the desired outcome without resorting 
> to a custom RA, but it does appear to be the only solution.
>

Well, looking at it from a different angle - you could use the knet
nozzle interface for replication which means your postgres
connectivity is guaranteed to be the same as pacemaker/corosync
connectivity.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Corosync 3.1.6 is available at corosync.org!

2021-11-15 Thread Jan Friesse

I am pleased to announce the latest maintenance release of Corosync
3.1.6 available immediately from GitHub release section at 
https://github.com/corosync/corosync/releases or our website at

http://build.clusterlabs.org/corosync/releases/.

This release contains MAJOR bugfix of totem protocol which caused loss 
or corruption of messages delivered during recovery phase. It is also 
important to pair this release with Kronosnet v1.23 (announcement 
https://lists.clusterlabs.org/pipermail/users/2021-November/029810.html) 
and Libqb 2.0.4 (announcement 
https://lists.clusterlabs.org/pipermail/users/2021-November/029811.html).


All our development team would like to thank the Proxmox VE maintainer, 
Fabian Gruenbichler, for the extremely detailed bug reports, reproducers 
and collecting all the data from the affected Proxmox VE users, and his 
dedication over the past month to debug, test and work with us.


Another important feature is addition of cancel_hold_on_retransmit 
option, which allows corosync to work in environments, where some 
packets are delayed more than other (caused by various Antivirus / IPS / 
IDS software).


Complete changelog for 3.1.6:

Christine Caulfield (1):
  cpghum: Allow to continue if corosync is restarted

Jan Friesse (4):
  totem: Add cancel_hold_on_retransmit config option
  logsys: Unlock config mutex on error
  totemsrp: Switch totempg buffers at the right time
  build: Add explicit dependency for used libraries

miharahiro (1):
  man: Fix consensus timeout

This upgrade is required.

Thanks/congratulations to all people that contributed to achieve this 
great milestone.


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] [Announce] libqb 2.0.4 released

2021-11-15 Thread Christine caulfield

We are pleased to announce the release of libqb 2.0.4

Source code is available at:
https://github.com/ClusterLabs/libqb/releases/

Please use the signed .tar.gz or .tar.xz files with the version number
in rather than the github-generated "Source Code" ones.

The most important fix in this release is that we no longer log errors 
inside the signal handler in loop_poll.c - this could cause an 
application hang in some circumstances.


There is also a new implementation of the timerlist that should inprove 
performance when a large number of timers are active.


shortlog:

Chrissie Caulfield (3):
  doxygen2man: print structure descriptions (#443)
  Fix pthread returns (#444)
  poll: Don't log in a signal handler (#447)

Jan Friesse (1):
  Implement heap based timer list (#439)

orbea (1):
  build: Fix undefined pthread reference. (#440)

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] kronosnet v1.23 released

2021-11-15 Thread Fabio M. Di Nitto

All,

We are pleased to announce the general availability of kronosnet v1.23.

This version contains MAJOR bug fixes and everybody is strongly 
encouraged to upgrade as soon as possible.


The defrag buffer fixes introduced in v1.22, revealed a long standing 
bug in corosync, 2 serious bugs in knet that have been lurking around 
for approx 11 years, and one bug in libqb (spotted during testing of the 
fixes in this release). Please make sure to upgrade all of them as soon 
as possible. Corosync and libqb releases are happening (or have already 
happened) as this announcement is being sent out.


All our development team would like to thank the Proxmox VE maintainer, 
Fabian Gruenbichler, for the extremely detailed bug reports, reproducers 
and collecting all the data from the affected Proxmox VE users, and his 
dedication over the past month to debug, test and work with us.


kronosnet (or knet for short) is the new underlying network protocol for 
Linux HA components (corosync), that features the ability to use 
multiple links between nodes, active/active and active/passive link 
failover policies, automatic link recovery, FIPS compliant encryption 
(nss and/or openssl), automatic PMTUd and in general better performance 
compared to the old network protocol.


Highlights in this release:

* [URGENT] Fix packet sequence number initialization race
* [URGENT] Fix UDP link down detection when other nodes restart too fast
* [minor] Fix nss buffer boundaries
* [minor] improved error logs to make it easier to debug improper setups
* [minor] improved logging to not drop log messages on socket overload
* Fix build with musl/glibc on archlinux
* Enhance security build using annocheck / annobin
* Minor bug fixes and enhancements in the test suite

Known issues in this release:

* The long standing SCTP problem with dynamic links (spotted while
  preparing v1.21) has not been addressed yet. The problem does NOT
  affect the corosync / High Availability use case.

The source tarballs can be downloaded here:

https://www.kronosnet.org/releases/

Upstream resources and contacts:

https://kronosnet.org/
https://github.com/kronosnet/kronosnet/
https://ci.kronosnet.org/
https://trello.com/kronosnet (TODO list and activities tracking)
https://goo.gl/9ZvkLS (google shared drive with presentations and diagrams)
IRC: #kronosnet on Libera
https://lists.kronosnet.org/mailman3/postorius/lists/users.lists.kronosnet.org/
https://lists.kronosnet.org/mailman3/postorius/lists/devel.lists.kronosnet.org/
https://lists.kronosnet.org/mailman3/postorius/lists/commits.lists.kronosnet.org/

Cheers,
The knet developer team
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Fence node when network interface goes down

2021-11-15 Thread Ulrich Windl
>>> S Rogers  schrieb am 15.11.2021 um 13:32 in 
>>> Nachricht
<8c836815-8fac-115e-4eb0-c1e73933c...@gmail.com>:


...
> Unfortunately, in some situations this cluster will be deployed in a 
> completely isolated network so there may not even be a router that we 
> can use as a ping target, and we can't guarantee the presence of any 
> other system on the network that we could reliably use as a ping target.
...

You could configure the actual router's address as secondary IP on the 
interface while deploying, so the cluster will be happy.




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-15 Thread S Rogers


On 15/11/2021 12:03, Klaus Wenninger wrote:



On Mon, Nov 15, 2021 at 12:19 PM Andrei Borzenkov 
 wrote:


On Mon, Nov 15, 2021 at 1:18 PM Klaus Wenninger
 wrote:
>
>
>
> On Mon, Nov 15, 2021 at 10:37 AM S Rogers
 wrote:
>>
>> I had thought about doing that, but the cluster is then
dependent on the
>> external system, and if that external system was to go down or
become
>> unreachable for any reason then it would falsely cause the
cluster to
>> failover or worse it could even take the cluster down
completely, if the
>> external system goes down and both nodes cannot ping it.
>
> You wouldn't necessarily have to ban resources from nodes that can't
> reach the external network. It would be enough to make them prefer
> the location that has connection. So if both lose connection 
one side
> would still stay up.
> Not to depend on something really external you might use the
> router to your external network as ping target.
> In case of fencing - triggered by whatever - and a potential
fence-race

The problem here is that nothing really triggers fencing. What
happens, is


Got that! Which is why I gave the hint how to prevent shutting down
services with ping first.
Taking care of what happens when nodes are fenced still makes sense.
Imagine a fence-race where the node running services loses just
to afterwards get the services moved back when it comes up again.

Klaus
Thanks, I wasn't aware of priority-fencing-delay. While it doesn't solve 
this problem, I can still use it to improve the fencing behaviour of the 
cluster in general.


Unfortunately, in some situations this cluster will be deployed in a 
completely isolated network so there may not even be a router that we 
can use as a ping target, and we can't guarantee the presence of any 
other system on the network that we could reliably use as a ping target.




- two postgres lose connection over external network, but cluster
nodes retain connectivity over another network
- postgres RA compares "latest timestamp" when selecting the best node
to fail over to
- primary postgres has better timestamp, so RA simply does not
consider secondary as suitable for (atomatic) failover

The only solution here - as long as fencing node on external
connectivity loss is acceptable - is modifying ethmonitor RA to fail
monitor operation in this case.

I was hoping to find a way to achieve the desired outcome without 
resorting to a custom RA, but it does appear to be the only solution.


This may not be the right audience, but does anyone know if it is a 
viable change to add an additional parameter to the ethmonitor RA that 
allows users to override the desired behaviour when the monitor 
operation fails? (ie, a 'monitor_force_fail' parameter that when set to 
true will cause the monitor operation to fail if it determines the 
interface is down)


Being relatively new to pacemaker, I don't know whether this goes 
against RA conventions/practices.




> you might use the rather new feature priority-fencing-delay
(give the node
> that is running valuable resources a benefit in the race) or go for
> fence_heuristics_ping (pseudo fence-resource that together with a
> fencing-topology prevents the node without access to a certain IP
> from fencing the other node).
>

https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html
>

https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py
>
> Klaus
> ___
>>
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home:https://www.clusterlabs.org/___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-15 Thread Klaus Wenninger
On Mon, Nov 15, 2021 at 12:19 PM Andrei Borzenkov 
wrote:

> On Mon, Nov 15, 2021 at 1:18 PM Klaus Wenninger 
> wrote:
> >
> >
> >
> > On Mon, Nov 15, 2021 at 10:37 AM S Rogers 
> wrote:
> >>
> >> I had thought about doing that, but the cluster is then dependent on the
> >> external system, and if that external system was to go down or become
> >> unreachable for any reason then it would falsely cause the cluster to
> >> failover or worse it could even take the cluster down completely, if the
> >> external system goes down and both nodes cannot ping it.
> >
> > You wouldn't necessarily have to ban resources from nodes that can't
> > reach the external network. It would be enough to make them prefer
> > the location that has connection. So if both lose connection  one side
> > would still stay up.
> > Not to depend on something really external you might use the
> > router to your external network as ping target.
> > In case of fencing - triggered by whatever - and a potential fence-race
>
> The problem here is that nothing really triggers fencing. What happens, is
>

Got that! Which is why I gave the hint how to prevent shutting down
services with ping first.
Taking care of what happens when nodes are fenced still makes sense.
Imagine a fence-race where the node running services loses just
to afterwards get the services moved back when it comes up again.

Klaus


>
> - two postgres lose connection over external network, but cluster
> nodes retain connectivity over another network
> - postgres RA compares "latest timestamp" when selecting the best node
> to fail over to
> - primary postgres has better timestamp, so RA simply does not
> consider secondary as suitable for (atomatic) failover
>
> The only solution here - as long as fencing node on external
> connectivity loss is acceptable - is modifying ethmonitor RA to fail
> monitor operation in this case.
>
> > you might use the rather new feature priority-fencing-delay (give the
> node
> > that is running valuable resources a benefit in the race) or go for
> > fence_heuristics_ping (pseudo fence-resource that together with a
> > fencing-topology prevents the node without access to a certain IP
> > from fencing the other node).
> >
> https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html
> >
> https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py
> >
> > Klaus
> > ___
> >>
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >>
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-15 Thread Andrei Borzenkov
On Mon, Nov 15, 2021 at 1:18 PM Klaus Wenninger  wrote:
>
>
>
> On Mon, Nov 15, 2021 at 10:37 AM S Rogers  wrote:
>>
>> I had thought about doing that, but the cluster is then dependent on the
>> external system, and if that external system was to go down or become
>> unreachable for any reason then it would falsely cause the cluster to
>> failover or worse it could even take the cluster down completely, if the
>> external system goes down and both nodes cannot ping it.
>
> You wouldn't necessarily have to ban resources from nodes that can't
> reach the external network. It would be enough to make them prefer
> the location that has connection. So if both lose connection  one side
> would still stay up.
> Not to depend on something really external you might use the
> router to your external network as ping target.
> In case of fencing - triggered by whatever - and a potential fence-race

The problem here is that nothing really triggers fencing. What happens, is

- two postgres lose connection over external network, but cluster
nodes retain connectivity over another network
- postgres RA compares "latest timestamp" when selecting the best node
to fail over to
- primary postgres has better timestamp, so RA simply does not
consider secondary as suitable for (atomatic) failover

The only solution here - as long as fencing node on external
connectivity loss is acceptable - is modifying ethmonitor RA to fail
monitor operation in this case.

> you might use the rather new feature priority-fencing-delay (give the node
> that is running valuable resources a benefit in the race) or go for
> fence_heuristics_ping (pseudo fence-resource that together with a
> fencing-topology prevents the node without access to a certain IP
> from fencing the other node).
> https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html
> https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py
>
> Klaus
> ___
>>
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-15 Thread Klaus Wenninger
On Mon, Nov 15, 2021 at 10:37 AM S Rogers  wrote:

> I had thought about doing that, but the cluster is then dependent on the
> external system, and if that external system was to go down or become
> unreachable for any reason then it would falsely cause the cluster to
> failover or worse it could even take the cluster down completely, if the
> external system goes down and both nodes cannot ping it.
>
You wouldn't necessarily have to ban resources from nodes that can't
reach the external network. It would be enough to make them prefer
the location that has connection. So if both lose connection  one side
would still stay up.
Not to depend on something really external you might use the
router to your external network as ping target.
In case of fencing - triggered by whatever - and a potential fence-race
you might use the rather new feature priority-fencing-delay (give the node
that is running valuable resources a benefit in the race) or go for
fence_heuristics_ping (pseudo fence-resource that together with a
fencing-topology prevents the node without access to a certain IP
from fencing the other node).
https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/2.0/html/Pacemaker_Explained/s-cluster-options.html
https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py

Klaus
___

> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-15 Thread S Rogers
I had thought about doing that, but the cluster is then dependent on the 
external system, and if that external system was to go down or become 
unreachable for any reason then it would falsely cause the cluster to 
failover or worse it could even take the cluster down completely, if the 
external system goes down and both nodes cannot ping it.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence node when network interface goes down

2021-11-15 Thread Strahil Nikolov via Users
Have you tried with ping and a location constraint for avoiding hosts that 
cannot ping an extrrnal system.
Best Regards,Strahil Nikolov
 
 
  On Mon, Nov 15, 2021 at 0:07, S Rogers wrote:   
Using on-fail=fence is what I initially tried, but it doesn't work 
unfortunately.

It looks like this is because the ethmonitor monitor operation won't 
actually fail when it detects a downed interface. It'll only fail if it 
is unable to update the CIB, as per this comment: 
https://github.com/ClusterLabs/resource-agents/blob/4824a7a83765a0596b7d9856d00102f53c8ce123/heartbeat/ethmonitor#L518

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
  
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/