[ClusterLabs] clutering rabbitmq

2024-06-20 Thread Damiano Giuliani
Hi, hope you guys can help me, we have builded up a rabbitmq cluster using pacemaker resource called rabbitmq-cluster. everything works as exptected till for maintenance reason, we shutted down the entire cluster gracefully. at the startup we noticed all the user and permissions were dropped and

[ClusterLabs] Detecting pacemaker version incompatibility during node rebuild

2024-06-13 Thread Madison Kelly
Hi all,   I'm working on a tool to rebuild a node that was lost. Given this scenario, upgrading the surviving node is not viable (at least, not until after the rebuild is completed and the services can be migrated).   I ran into a problem where 'pcs cluster start' exits with RC 0, and it

Re: [ClusterLabs] lost corosync/pacemaker pair

2024-06-13 Thread Ken Gaillot
On Thu, 2024-06-13 at 03:22 +, eli estrella wrote: > Hello. > I recently lost one of my LB servers running in a corosync/pacemaker > pair, would it be possible to clone the live one to create the lost > pair, changing the IP, hostname etc? > Thanks for any help you can provide. > Yes, that

[ClusterLabs] lost corosync/pacemaker pair

2024-06-13 Thread eli estrella
Hello. I recently lost one of my LB servers running in a corosync/pacemaker pair, would it be possible to clone the live one to create the lost pair, changing the IP, hostname etc? Thanks for any help you can provide. ___ Manage your subscription:

Re: [ClusterLabs] PCS security vulnerability

2024-06-12 Thread Ondrej Mular
Hello Sathish, The CVEs you mentioned (CVE-2024-25126, CVE-2024-26141, CVE-2024-26146) were filed against the rack rubygem and not PCS itself. Therefore, the PCS upstream project is not directly impacted by these CVEs and doesn't require a change. However, PCS does rely on and uses the rack

[ClusterLabs] Pacemaker 2.1.8-rc2 released

2024-06-11 Thread Ken Gaillot
Hi all, The second release candidate for Pacemaker 2.1.8 is available at: https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.1.8-rc2 This mainly fixes issues introduced in 2.1.8-rc1, with a couple of other memory fixes and one new feature: the PCMK_panic_action environment

[ClusterLabs] PCS security vulnerability

2024-06-11 Thread S Sathish S via Users
Hi Tomas/Team, In our application we are using pcs-0.10.16 version and that module has vulnerability(CVE-2024-25126,CVE-2024-26141,CVE-2024-26146) reported and fixed on below RHSA Errata. can you check and provided fixed on PCS 0.10.x latest version on upstream also.

Re: [ClusterLabs] Need advice: deep pacemaker integration, best approach?

2024-06-10 Thread Klaus Wenninger
On Mon, Jun 10, 2024 at 6:12 PM Ken Gaillot wrote: > On Sun, 2024-06-09 at 23:13 +0300, ale...@pavlyuts.ru wrote: > > Hi All, > > > > We intend to integrate Pacemaker as failover engine into a very > > specific product. The handmade prototype works pretty well. It > > includes a couple of dozens

Re: [ClusterLabs] Need advice: deep pacemaker integration, best approach?

2024-06-10 Thread Ken Gaillot
On Sun, 2024-06-09 at 23:13 +0300, ale...@pavlyuts.ru wrote: > Hi All, > > We intend to integrate Pacemaker as failover engine into a very > specific product. The handmade prototype works pretty well. It > includes a couple of dozens coordinated resources to implement one > target application

[ClusterLabs] Need advice: deep pacemaker integration, best approach?

2024-06-09 Thread alexey
Hi All, We intend to integrate Pacemaker as failover engine into a very specific product. The handmade prototype works pretty well. It includes a couple of dozens coordinated resources to implement one target application instance with its full network configuration. The prototype was made with

[ClusterLabs] Booth 1.2 is available at GitHub!

2024-06-07 Thread Jan Friesse
I am pleased to announce the latest maintenance release of Booth 1.2 is available immediately from GitHub at https://github.com/ClusterLabs/booth/releases as booth-1.2. Booth 1.2 implements support for changes in Pacemaker 3. It is known that older versions of Booth will not work properly

Re: [ClusterLabs] Is XSD definition of CIB available?

2024-06-06 Thread Ken Gaillot
On Thu, 2024-06-06 at 16:07 +0300, ale...@pavlyuts.ru wrote: > Hi all, > > Is there XSD scheme for Pacemaker CIB available as a document to see > the full XML syntax and definitions? We use RNG. The starting point is xml/pacemaker.rng in the repository, typically installed in

[ClusterLabs] Is XSD definition of CIB available?

2024-06-06 Thread alexey
Hi all, Is there XSD scheme for Pacemaker CIB available as a document to see the full XML syntax and definitions? I was tried to search over the sources, but got no success. Thank you in advance! Alex ___ Manage your subscription:

[ClusterLabs] kronosnet v1.29 released

2024-06-06 Thread Fabio M. Di Nitto
All, We are pleased to announce the general availability of kronosnet v1.29 kronosnet (or knet for short) is the new underlying network protocol for Linux HA components (corosync), that features the ability to use multiple links between nodes, active/active and active/passive link failover

Re: [ClusterLabs] Thoughts on priority-fence-delay

2024-06-03 Thread Ken Gaillot
On Mon, 2024-06-03 at 14:32 +0800, Mr.R via Users wrote: > Hi, all > > The priority-fence-delay attribute adds a delay to the node > running with a higher priority > in the primary and secondary resources, making the node fence take > effect later. Is it possible > to apply a similar

[ClusterLabs] Thoughts on priority-fence-delay

2024-06-03 Thread Mr.R via Users
Hi, all The priority-fence-delay attribute adds a delay to the node running with a higher priority in the primary and secondary resources, making the node fence take effect later. Is it possible to apply a similar mechanism to common types of resources, so that nodes running more

Re: [ClusterLabs] Strange behavior of Resource stickiness

2024-05-28 Thread Александр Руденко
Klaus, yes, these constraints were defined by pcs after manual move (pcs resource move) and help about this action is clear: Usage: pcs resource move... move [destination node] [--master] [lifetime=] [--wait[=n]] Move the resource off the node it is currently running on by

Re: [ClusterLabs] Strange behavior of Resource stickiness

2024-05-28 Thread Klaus Wenninger
On Tue, May 28, 2024 at 12:34 PM Александр Руденко wrote: > Andrei, thank you! > > I tried to find node's scores and have found location constraints for > these 3 resources: > > pcs constraint > Location Constraints: > Resource: fsmt-28085F00 > Enabled on: > Node: vdc16

Re: [ClusterLabs] Strange behavior of Resource stickiness

2024-05-28 Thread Klaus Wenninger
On Tue, May 28, 2024 at 10:40 AM Александр Руденко wrote: > Hi! > > I can't understand this strange behavior, help me please. > > I have 3 nodes in my cluster, 4 vCPU/8GB RAM each. And about 70 groups, 2 > resources in each group. First one resource is our custom resource which > configures

Re: [ClusterLabs] Strange behavior of Resource stickiness

2024-05-28 Thread Александр Руденко
Andrei, thank you! I tried to find node's scores and have found location constraints for these 3 resources: pcs constraint Location Constraints: Resource: fsmt-28085F00 Enabled on: Node: vdc16 (score:INFINITY) (role:Started) Resource: fsmt-41CC55C0 Enabled on: Node: vdc16

Re: [ClusterLabs] Strange behavior of Resource stickiness

2024-05-28 Thread Andrei Borzenkov
On Tue, May 28, 2024 at 11:39 AM Александр Руденко wrote: > > Hi! > > I can't understand this strange behavior, help me please. > > I have 3 nodes in my cluster, 4 vCPU/8GB RAM each. And about 70 groups, 2 > resources in each group. First one resource is our custom resource which > configures

[ClusterLabs] Strange behavior of Resource stickiness

2024-05-28 Thread Александр Руденко
Hi! I can't understand this strange behavior, help me please. I have 3 nodes in my cluster, 4 vCPU/8GB RAM each. And about 70 groups, 2 resources in each group. First one resource is our custom resource which configures Linux VRF and second one is systemd unit. Everything works fine. We have

Re: [ClusterLabs] crm services not getting started after upgrading to snmp40

2024-05-23 Thread Klaus Wenninger
On Wed, May 22, 2024 at 5:16 PM ., Anoop wrote: > What is the "certain filesystem"? If cluster services require it, that > would explain why they can't start. > - Here we have btrfs and xfs filesystems. Yes cluster services require > these filesystem to be mounted. > What do the systemd

Re: [ClusterLabs] crm services not getting started after upgrading to snmp40

2024-05-22 Thread ., Anoop
What is the "certain filesystem"? If cluster services require it, that would explain why they can't start. - Here we have btrfs and xfs filesystems. Yes cluster services require these filesystem to be mounted. What do the systemd journal logs say about the filesystem and cluster services? Did

Re: [ClusterLabs] crm services not getting started after upgrading to snmp40

2024-05-22 Thread Ken Gaillot
On Wed, 2024-05-22 at 07:33 +, ., Anoop wrote: > Hello, > > We have HA setup with 2 node cluster using CRM. OS is Suse 15sp3. > After upgrading to snmp40, cluster services are not getting started > like pacemaker , corosync etc. After booting we have to manually > mount certain filesystem

Re: [ClusterLabs] Disabled resources after parallel removing of group

2024-05-22 Thread Miroslav Lisik
Hi, see comments inline. On 5/17/24 17:46, Александр Руденко wrote: Miroslav, thank you! It helps me understand that it's not a configuration issue. BTW, is it okay to create new resources in parallel? Same as with parallel 'remove' operations it is not safe to do parallel 'create'

[ClusterLabs] crm services not getting started after upgrading to snmp40

2024-05-22 Thread ., Anoop
Hello, We have HA setup with 2 node cluster using CRM. OS is Suse 15sp3. After upgrading to snmp40, cluster services are not getting started like pacemaker , corosync etc. After booting we have to manually mount certain filesystem and start the crm services like pacemaker etc. We have a

Re: [ClusterLabs] Disabled resources after parallel removing of group

2024-05-20 Thread Александр Руденко
Alexey, thank you! Now, it's clear for me. сб, 18 мая 2024 г. в 02:11, : > Hi Alexander, > > > > AFAIK, Pacemaker itself only have deal with XML-based configuration > database, shared across all cluster. Each time you call pcs or any other > tool it takes XML (or part of it) from pacemaker,

Re: [ClusterLabs] Disabled resources after parallel removing of group

2024-05-17 Thread alexey
Hi Alexander, AFAIK, Pacemaker itself only have deal with XML-based configuration database, shared across all cluster. Each time you call pcs or any other tool it takes XML (or part of it) from pacemaker, tweaks it and then push it back to Pacemaker. Each time XML is pushed, Pacemaker

Re: [ClusterLabs] Disabled resources after parallel removing of group

2024-05-17 Thread Александр Руденко
Miroslav, thank you! It helps me understand that it's not a configuration issue. BTW, is it okay to create new resources in parallel? On timeline it looks like: pcs resource create resA1 --group groupA pcs resource create resB1 --group groupB resA1 Started pcs resource create resA2

Re: [ClusterLabs] Disabled resources after parallel removing of group

2024-05-17 Thread Miroslav Lisik
Hi Aleksandr! It is not safe to use `pcs resource remove` command in parallel because you run into the same issues as you already described. Processes run by remove command are not synchronized. Unfortunately, remove command does not support more than one resource yet. If you really need to

[ClusterLabs] Disabled resources after parallel removing of group

2024-05-17 Thread Александр Руденко
Hi! I am new in the pacemaker world, and I, unfortunately, have problems with simple actions like group removal. Please, help me understand when I'm wrong. For simplicity I will use standard resources like IPaddr2 (but we have this problem on any type of our custom resources). I have 5 groups

[ClusterLabs] Pacemaker 2.1.8-rc1 released

2024-05-15 Thread Ken Gaillot
Hi all, The first release candidate for Pacemker 2.1.8 is available at: https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-2.1.8-rc1 This release has a number of new features and bugfixes, along with deprecations of (mostly undocumented and/or broken) some obscure features. Two

Re: [ClusterLabs] Mixing globally-unique with non-globally-unique resources

2024-05-14 Thread Ken Gaillot
On Tue, 2024-05-14 at 13:56 +0200, Jochen wrote: > I have the following use case: There are several cluster IP addresses > in the cluster. Each address is different, and multiple addresses can > be scheduled on the same node. This makes the address clone a > globally-unique clone as far as I

[ClusterLabs] Mixing globally-unique with non-globally-unique resources

2024-05-14 Thread Jochen
I have the following use case: There are several cluster IP addresses in the cluster. Each address is different, and multiple addresses can be scheduled on the same node. This makes the address clone a globally-unique clone as far as I understood. Then I have one service per node which manages

[ClusterLabs] FYI: clusterlabs.org planned outages

2024-05-07 Thread Ken Gaillot
Hi all, We are in the process of changing the OS on the servers used to run the clusterlabs.org sites. There is an expected outage of all services from 4AM to 9AM UTC this Thursday. If problems arise, there may be more outages later Thursday and Friday. -- Ken Gaillot

Re: [ClusterLabs] [EXT] Re: Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-07 Thread Windl, Ulrich
Hi! On " First of all, there no fencing at all, it is off." Maybe the default configuration should involve a fencing agent that sends an SMS like this to all admins: "Hey, get out of the bed and drive to work: nodeX has to be reset to continue working. You get this message, because you didn't

Re: [ClusterLabs] [EXT] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-07 Thread Windl, Ulrich
Hi! Just some personal comment: If an application isn't cluster-aware (has no provisions to run in a HA environment), you may improve its uptime using a cluster, but you cannot really make it "HA". Just consider the app needs manual intervention after it crashed... Kind regards, Ulrich From:

Re: [ClusterLabs] [EXT] Re: "pacemakerd: recover properly from Corosync crash" fix

2024-05-07 Thread Windl, Ulrich
Hi! I wonder: Shouldn’t node fencing step in? What do other nodes say about the situation? Regards, Ulrich From: Users On Behalf Of Klaus Wenninger Sent: Monday, April 22, 2024 11:06 AM To: NOLIBOS Christophe Cc: Cluster Labs - All topics related to open-source clustering welcomed Subject:

Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-06 Thread Ken Gaillot
On Mon, 2024-05-06 at 10:05 -0500, Ken Gaillot wrote: > On Fri, 2024-05-03 at 16:18 +0300, ale...@pavlyuts.ru wrote: > > Hi, > > > > > > Thanks great for your suggestion, probably I need to think > > > > about > > > > this > > > > way too, however, the project environment is not a good one to > >

Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-06 Thread Ken Gaillot
On Fri, 2024-05-03 at 16:18 +0300, ale...@pavlyuts.ru wrote: > Hi, > > > > Thanks great for your suggestion, probably I need to think about > > > this > > > way too, however, the project environment is not a good one to > > > rely on > > > fencing and, moreover, we can't control the bottom layer

Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-06 Thread Klaus Wenninger
On Fri, May 3, 2024 at 8:59 PM wrote: > Hi, > > > > Also, I've done wireshark capture and found great mess in TCP, it > > > seems like connection between qdevice and qnetd really stops for some > > > time and packets won't deliver. > > > > Could you check UDP? I guess there is a lot of UDP

Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-03 Thread alexey
Hi, > > Also, I've done wireshark capture and found great mess in TCP, it > > seems like connection between qdevice and qnetd really stops for some > > time and packets won't deliver. > > Could you check UDP? I guess there is a lot of UDP packets sent by corosync > which probably makes TCP to

Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-03 Thread alexey
Hi, > > Thanks great for your suggestion, probably I need to think about this > > way too, however, the project environment is not a good one to rely on > > fencing and, moreover, we can't control the bottom layer a trusted > > way. > > That is a problem. A VM being gone is not the only possible

Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-03 Thread Jan Friesse
Hi, some of your findings are really interesting. On 02/05/2024 01:56, ale...@pavlyuts.ru wrote: Hi All, I am trying to build application-specific 2-node failover cluster using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9, knet transport. ... Also, I've done

Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-02 Thread Ken Gaillot
On Thu, 2024-05-02 at 22:56 +0300, ale...@pavlyuts.ru wrote: > Dear Ken, > > First of all, there no fencing at all, it is off. > > Thanks great for your suggestion, probably I need to think about this > way too, however, the project environment is not a good one to rely > on fencing and,

Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-02 Thread alexey
Dear Ken, First of all, there no fencing at all, it is off. Thanks great for your suggestion, probably I need to think about this way too, however, the project environment is not a good one to rely on fencing and, moreover, we can't control the bottom layer a trusted way. As I understand,

Re: [ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-02 Thread Ken Gaillot
I don't see fencing times in here -- fencing is absolutely essential. With the setup you describe, I would drop qdevice. With fencing, quorum is not strictly required in a two-node cluster (two_node should be set in corosync.conf). You can set priority-fencing-delay to reduce the chance of

[ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

2024-05-01 Thread alexey
Hi All, I am trying to build application-specific 2-node failover cluster using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9, knet transport. For some reason I can't use 3-node then I have to use qnetd+qdevice 3.0.1. The main goal Is to protect custom app which is not

Re: [ClusterLabs] corosync service stopping

2024-04-30 Thread Alexander Eastwood via Users
Hi Honza I would say there is still a certain ambiguity in "shutdown by cfg request”, but I would argue that by not using the term “sysadmin” it at least doesn’t suggest that the shutdown was triggered by a human. So yes, I think that this phrasing is less misleading. Cheers, Alex > On

Re: [ClusterLabs] corosync service stopping

2024-04-29 Thread Jan Friesse
Hi, I will reply just to "sysadmin" question: On 26/04/2024 14:43, Alexander Eastwood via Users wrote: Dear Reid, ... Why does the corosync log say ’shutdown by sysadmin’ when the shutdown was triggered by pacemaker? Isn’t this misleading? This basically means shutdown was triggered by

Re: [ClusterLabs] corosync service stopping

2024-04-26 Thread Alexander Eastwood via Users
Dear Reid,Thanks for the reply. Yes, lots of pacemaker logs - I have included just over a minute of them below and 5m of them as an attached .log file. The same behaviour occurs for a period of roughly 6 minutes before the corosync shutdown happens and can be summarised like so:Both cluster nodes

Re: [ClusterLabs] corosync service stopping

2024-04-25 Thread Reid Wahl
Any logs from Pacemaker? On Thu, Apr 25, 2024 at 3:46 AM Alexander Eastwood via Users wrote: > > Hi all, > > I’m trying to get a better understanding of why our cluster - or specifically > corosync.service - entered a failed state. Here are all of the relevant > corosync logs from this event,

[ClusterLabs] corosync service stopping

2024-04-25 Thread Alexander Eastwood via Users
Hi all, I’m trying to get a better understanding of why our cluster - or specifically corosync.service - entered a failed state. Here are all of the relevant corosync logs from this event, with the last line showing when I manually started the service again: Apr 23 11:06:10 [1295854]

Re: [ClusterLabs] Remote nodes in an opt-in cluster

2024-04-24 Thread Reid Wahl
Jochen, Just making sure you saw this, since you've replied to Andrei since then On Tue, Apr 23, 2024 at 12:39 AM Reid Wahl wrote: > > On Tue, Apr 23, 2024 at 12:03 AM Jochen wrote: > > > > When trying to add a remote node to an opt-in cluster, the cluster does not > > start the remote

[ClusterLabs] resource-agents v4.14.0

2024-04-24 Thread Oyvind Albrigtsen
ClusterLabs is happy to announce resource-agents v4.14.0. Source code is available at: https://github.com/ClusterLabs/resource-agents/releases/tag/v4.14.0 The most significant enhancements in this release are: - bugfixes and enhancements: - doc: writing-python-agents: add description of

Re: [ClusterLabs] Remote nodes in an opt-in cluster

2024-04-23 Thread Andrei Borzenkov
On 23.04.2024 19:40, Jochen wrote: On 23. Apr 2024, at 17:41, Andrei Borzenkov wrote: On 23.04.2024 10:02, Jochen wrote: When trying to add a remote node to an opt-in cluster, the cluster does not start the remote resource. When I change the cluster to opt-out the remote resource is

Re: [ClusterLabs] Remote nodes in an opt-in cluster

2024-04-23 Thread Jochen
> On 23. Apr 2024, at 17:41, Andrei Borzenkov wrote: > > On 23.04.2024 10:02, Jochen wrote: >> When trying to add a remote node to an opt-in cluster, the cluster does not >> start the remote resource. When I change the cluster to opt-out the remote >> resource is started. > > It's not

Re: [ClusterLabs] Remote nodes in an opt-in cluster

2024-04-23 Thread Andrei Borzenkov
On 23.04.2024 10:02, Jochen wrote: When trying to add a remote node to an opt-in cluster, the cluster does not start the remote resource. When I change the cluster to opt-out the remote resource is started. It's not clear what do you mean. Is "remote resource" the resource used to

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-23 Thread Klaus Wenninger
On Tue, Apr 23, 2024 at 10:34 AM Klaus Wenninger wrote: > > > On Tue, Apr 23, 2024 at 9:53 AM NOLIBOS Christophe < > christophe.noli...@thalesgroup.com> wrote: > >> Classified as: {OPEN} >> >> >> >> Other strange thing. >> >> On RHEL 7, corosync is restarted while the “Restart=on-failure » line

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-23 Thread Klaus Wenninger
On Tue, Apr 23, 2024 at 9:53 AM NOLIBOS Christophe < christophe.noli...@thalesgroup.com> wrote: > Classified as: {OPEN} > > > > Other strange thing. > > On RHEL 7, corosync is restarted while the “Restart=on-failure » line is > commented. > > I think also that something changed in the pacemaker

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-23 Thread NOLIBOS Christophe via Users
Classified as: {OPEN} Other strange thing. On RHEL 7, corosync is restarted while the “Restart=on-failure » line is commented. I think also that something changed in the pacemaker behavior, or somewhere else. De : Klaus Wenninger Envoyé : lundi 22 avril 2024 12:41 À : NOLIBOS

Re: [ClusterLabs] Remote nodes in an opt-in cluster

2024-04-23 Thread Reid Wahl
On Tue, Apr 23, 2024 at 12:03 AM Jochen wrote: > > When trying to add a remote node to an opt-in cluster, the cluster does not > start the remote resource. When I change the cluster to opt-out the remote > resource is started. > > I guess I have to add a location constraint to allow the cluster

[ClusterLabs] Remote nodes in an opt-in cluster

2024-04-23 Thread Jochen
When trying to add a remote node to an opt-in cluster, the cluster does not start the remote resource. When I change the cluster to opt-out the remote resource is started. I guess I have to add a location constraint to allow the cluster to schedule the resource. Is that correct? And if yes,

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-22 Thread Klaus Wenninger
On Mon, Apr 22, 2024 at 12:32 PM NOLIBOS Christophe < christophe.noli...@thalesgroup.com> wrote: > Classified as: {OPEN} > > > > You are right : the “Restart=on-failure” line is commented and so, > disabled per default. > > Uncommenting it resolves my issue. > Maybe pacemaker changed behavior

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-22 Thread NOLIBOS Christophe via Users
Classified as: {OPEN} You are right : the “Restart=on-failure” line is commented and so, disabled per default. Uncommenting it resolves my issue. Thanks a lot. Christophe. De : Klaus Wenninger Envoyé : lundi 22 avril 2024 11:06 À : NOLIBOS Christophe Cc : Cluster Labs - All

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-22 Thread Klaus Wenninger
On Mon, Apr 22, 2024 at 9:51 AM NOLIBOS Christophe < christophe.noli...@thalesgroup.com> wrote: > Classified as: {OPEN} > > > > ‘kill -9’ command. > > Is it gracefully exit? > Looking as if corosync-unit-file has Restart=on-failure disabled per default. I'm not aware of another mechanism that

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-22 Thread NOLIBOS Christophe via Users
Classified as: {OPEN} ‘kill -9’ command. Is it gracefully exit? De : Klaus Wenninger Envoyé : jeudi 18 avril 2024 20:17 À : NOLIBOS Christophe Cc : Cluster Labs - All topics related to open-source clustering welcomed Objet : Re: [ClusterLabs] "pacemakerd: recover properly from

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread Klaus Wenninger
NOLIBOS Christophe schrieb am Do., 18. Apr. 2024, 19:01: > Classified as: {OPEN} > > > > Hummm… my RHEL 8.8 OS has been hardened. > > I am wondering if the problem does not come from that. > > > > On another side, I get the same issue (i.e. corosync not restarted by > system) with Pacemaker

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread NOLIBOS Christophe via Users
Classified as: {OPEN} Hummm… my RHEL 8.8 OS has been hardened. I am wondering if the problem does not come from that. On another side, I get the same issue (i.e. corosync not restarted by system) with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened). I’m checking. {OPEN}

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread NOLIBOS Christophe via Users
Classified as: {OPEN} So, the issue is on systemd? If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker 1.1.13-10, corosync is correctly restarted by systemd. [RHEL7 ~]# journalctl -f -- Logs begin at Wed 2024-01-03 13:15:41 UTC. -- Apr 18 16:26:55 - systemd[1]:

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread Klaus Wenninger
On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger wrote: > > > On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe < > christophe.noli...@thalesgroup.com> wrote: > >> Classified as: {OPEN} >> >> >> >> Well… why do you say that « Well if corosync isn't there that this is >> to be expected and

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread NOLIBOS Christophe via Users
Classified as: {OPEN} Well… why do you say that « Well if corosync isn't there that this is to be expected and pacemaker won't recover corosync.”? In my mind, Corosync is managed by Pacemaker as any other cluster resource and the "pacemakerd: recover properly from > Corosync crash" fix

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread NOLIBOS Christophe via Users
Classified as: {OPEN} [~]$ systemctl status corosync ● corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor preset: disabled) Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC; 53min ago Docs:

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread Klaus Wenninger
On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users < users@clusterlabs.org> wrote: > Classified as: {OPEN} > > I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64). > When I kill Corosync, no new corosync process is created and pacemaker is > in failure. > The only solution is to restart

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread NOLIBOS Christophe via Users
Classified as: {OPEN} I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64). When I kill Corosync, no new corosync process is created and pacemaker is in failure. The only solution is to restart the pacemaker service. [~]$ pcs status Error: unable to get cib [~]$ [~]$systemctl status pacemaker ●

Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread Ken Gaillot
What OS are you using? Does it use systemd? What does happen when you kill Corosync? On Thu, 2024-04-18 at 13:13 +, NOLIBOS Christophe via Users wrote: > Classified as: {OPEN} > > Dear All, > > I have a question about the "pacemakerd: recover properly from > Corosync crash" fix

[ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

2024-04-18 Thread NOLIBOS Christophe via Users
Classified as: {OPEN} Dear All, I have a question about the "pacemakerd: recover properly from Corosync crash" fix implemented in version 2.1.2. I have observed the issue when testing pacemaker version 2.0.5, just by killing the ‘corosync’ process: Corosync was not recovered. I am using

Re: [ClusterLabs] Is Pacemaker 2.1.7 compatible with crmsh-4.6.0 ?

2024-04-18 Thread Alejandro Imass
On Thu, Apr 18, 2024 at 3:02 AM Nicholas Yang via Users < users@clusterlabs.org> wrote: > > NVM, I found them in /usr/local/libexec/pacemaker/ > > > > I'm guessing a lot of my troubles have been related to the fact this is > not added to the path by the port pkg! > > This is not a problem. crmsh

Re: [ClusterLabs] Is Pacemaker 2.1.7 compatible with crmsh-4.6.0 ?

2024-04-17 Thread Nicholas Yang via Users
> NVM, I found them in /usr/local/libexec/pacemaker/ > > I'm guessing a lot of my troubles have been related to the fact this is not > added to the path by the port pkg! This is not a problem. crmsh adds libexec directories to PATH. See

[ClusterLabs] Is Pacemaker 2.1.7 compatible with crmsh-4.6.0 ?

2024-04-17 Thread Alejandro Imass
In python3.9/site-packages/crmsh/utils.py line 157: @memoize def pacemaker_controld(): return pacemaker_20_daemon("pacemaker-controld", "crmd") Neither of these programs are installed in my system: pkg info -l pacemaker2 | grep bin /usr/local/sbin/attrd_updater

[ClusterLabs] Not even ChatGPT can translate your example constraints from pcs to crmsh!

2024-04-17 Thread Alejandro Imass
I am trying to translate the last three rules of the Wiki example for Pg to crmsh, and I think I've tried everything in the doc and even tried with ChatGPT and it gave up. This is the Wiki example: https://projects.clusterlabs.org/w/cluster_administration/pgsql_replicated_cluster/ And these are

[ClusterLabs] Likely deprecation: ocf:pacemaker:o2cb resource agent

2024-04-17 Thread Ken Gaillot
Hi all, I just discovered today that the OCFS2 file system hasn't needed ocf_controld.pcmk in nearly a decade. I can't recall ever running across anyone using the ocf:pacemaker:o2cb agent that manages that daemon in a cluster. Unless anyone has a good reason to the contrary, we'll deprecate the

[ClusterLabs] resource-agents v4.14.0 rc1

2024-04-17 Thread Oyvind Albrigtsen
ClusterLabs is happy to announce resource-agents v4.14.0 rc1. Source code is available at: https://github.com/ClusterLabs/resource-agents/releases/tag/v4.14.0rc1 The most significant enhancements in this release are: - bugfixes and enhancements: - all agents: remove -S state/status that are

[ClusterLabs] FreeBSD Jails: WARNING: could not get..version, WARNING: list index

2024-04-16 Thread Alejandro Imass
Hi there, I have Corosync and Pacemaker working inside Bastille-driven jails by using VNET Jails, with a 3 node playground cluster and tested with IPaddr and everything seems to be working as expected. Environment: FreeBSD 14.0-RELEASE Bastille: 0.10.20231125 Pacemaker 2.1.6 Corosync 3.1.7

[ClusterLabs] FreeBSD Jails: WARNING: could not get..version, WARNING: list index

2024-04-16 Thread Alejandro Imass
Hi there, I have Corosync and Pacemaker working inside Bastille-driven jails by using VNET Jails, with a 3 node playground cluster and tested with IPaddr + pgsql everything seems mostly to be working as expected. Environment: FreeBSD 14.0-RELEASE Bastille: 0.10.20231125 Pacemaker 2.1.6 Corosync

Re: [ClusterLabs] Pcsd port change after cluster setup

2024-04-15 Thread Strahil Nikolov via Users
The interesting part is that after repeating the process (update the file, stop & start pcsd and pcs host auth ) everything is working fine including the web UI. Best Regards, Strahil Nikolov On Mon, Apr 15, 2024 at 17:20, Strahil Nikolov via Users wrote: Hi All, I need your help to

[ClusterLabs] Pcsd port change after cluster setup

2024-04-15 Thread Strahil Nikolov via Users
Hi All, I need your help to change the pcsd port.I set the port in /etc/sysconfig/pcsd on all nodes:PCSD_PORT=3500 Yet, the daemon is not listening on it. Best Regards, Strahil Nikolov___ Manage your subscription:

[ClusterLabs] announcement: schedule for resource-agents release 4.14.0

2024-04-08 Thread Oyvind Albrigtsen
Hi, This is a tentative schedule for resource-agents v4.14.0: 4.14.0-rc1: Apr 17. 4.14.0: Apr 24. Full list of changes: https://github.com/ClusterLabs/resource-agents/compare/v4.13.0...main I've modified the corresponding milestones at: https://github.com/ClusterLabs/resource-agents/milestones

[ClusterLabs] fence-agents v4.14.0

2024-04-08 Thread Oyvind Albrigtsen
ClusterLabs is happy to announce fence-agents v4.14.0. The source code is available at: https://github.com/ClusterLabs/fence-agents/releases/tag/v4.14.0 The most significant enhancements in this release are: - new fence agents: - fence_ovm (Oracle VM) - bugfixes and enhancements: - all

[ClusterLabs] Potential deprecation: Node-attribute-based rules in operation meta-attributes

2024-04-02 Thread Ken Gaillot
Hi all, I have recently been cleaning up Pacemaker's rule code, and came across an inconsistency. Currently, meta-attributes may have rules with date/time-based expressions (the element). Node attribute expressions (the element) are not allowed, with the exception of operation meta-attributes

[ClusterLabs] Potential deprecation: Disabling schema validation for the CIB

2024-04-02 Thread Ken Gaillot
Hi all, Pacemaker uses an XML schema to prevent invalid syntax from being added to the CIB. The CIB's "validate-with" option is typically set to a version of this schema (like "pacemaker-3.9"). It is possible to explicitly disable schema validation by setting validate-with to "none". This is

Re: [ClusterLabs] PostgreSQL server timelines offset after promote

2024-04-02 Thread FLORAC Thierry
Hi Jehan-Guillaume, Thanks for your links, but I already had a look at them when using Pacemaker for the first time (except for the last one). Actually, I forgot to mention that PostgreSQL and pacemaker are run on a Debian GNU/Linux system (latest "Bookworm" release); on reboot, Pacemaker is

Re: [ClusterLabs] Fencing doesn't work with google-cloud-cli

2024-03-27 Thread Strahil Nikolov via Users
Hi All, I'm sorry for the previous post. Most probably it's not google-cloud-cli as even after downgrading, fencing still doesn't work all the time. Best Regards, Strahil Nikolov В сряда, 27 март 2024 г. в 15:39:06 ч. Гринуич+2, Strahil Nikolov via Users написа: Hi All, I'm

[ClusterLabs] Fencing doesn't work with google-cloud-cli

2024-03-27 Thread Strahil Nikolov via Users
Hi All, I'm starting this thread in order to warn you that if you updated recently and 'google-cloud-cli' rpm was deployed (obsoletes 'google-cloud-sdk'), fencing won't work for you despite that fence_gce and 'pcs stonith fence' report success. The VM stays in a odd status (right now I don't

Re: [ClusterLabs] PostgreSQL server timelines offset after promote

2024-03-27 Thread Jehan-Guillaume de Rorthais via Users
Bonjour Thierry, On Mon, 25 Mar 2024 10:55:06 + FLORAC Thierry wrote: > I'm trying to create a PostgreSQL master/slave cluster using streaming > replication and pgsqlms agent. Cluster is OK but my problem is this : the > master node is sometimes restarted for system operations, and the

[ClusterLabs] PostgreSQL server timelines offset after promote

2024-03-25 Thread FLORAC Thierry
Hi, I'm trying to create a PostgreSQL master/slave cluster using streaming replication and pgsqlms agent. Cluster is OK but my problem is this : the master node is sometimes restarted for system operations, and the slave is then promoted without any problem ; after reboot, the old master is

Re: [ClusterLabs] resources cluster stoped with one node

2024-03-21 Thread Tomas Jelinek
Dne 20. 03. 24 v 23:56 Ken Gaillot napsal(a): On Wed, 2024-03-20 at 23:29 +0100, mierdatutis mi wrote: HI, I've configured a cluster of two nodes. When I start one node only I see that the resources won't start. Hi, In a two-node cluster, it is not safe to start resources until the nodes

Re: [ClusterLabs] resources cluster stoped with one node

2024-03-20 Thread Ken Gaillot
On Wed, 2024-03-20 at 23:29 +0100, mierdatutis mi wrote: > HI, > I've configured a cluster of two nodes. > When I start one node only I see that the resources won't start. Hi, In a two-node cluster, it is not safe to start resources until the nodes have seen each other once. Otherwise, there's

[ClusterLabs] resources cluster stoped with one node

2024-03-20 Thread mierdatutis mi
HI, I've configured a cluster of two nodes. When I start one node only I see that the resources won't start. *[root@nodo1 ~]# pcs status --fullCluster name: myclusterStack: corosyncCurrent DC: nodo1 (1) (version 1.1.23-1.el7-9acf116022) - partition WITHOUT

  1   2   3   4   5   6   7   8   9   10   >