Re: [ClusterLabs] ocf:heartbeat:IPsrcaddr generated failed probe "[findif] failed" on inactive nodes
On 2/7/24 09:49, Oyvind Albrigtsen wrote: On 07/02/24 09:35 +0100, Adam Cecile wrote: Hello, Crm_mon show these errors on my cluster, while everything is working as expected: Failed Resource Actions: * Default-Public-IPv4-Is-Default-Src probe on gw-3.domain returned 'error' ([findif] failed) at Wed Feb 7 08:00:22 2024 after 49ms * Default-Public-IPv4-Is-Default-Src probe on gw-1.domain returned 'error' ([findif] failed) at Wed Feb 7 08:00:22 2024 after 48ms * Default-Public-IPv4-Is-Default-Src probe on gw-2.domain returned 'error' ([findif] failed) at Wed Feb 7 08:02:31 2024 after 64ms I think pacemaker is unable to check default source address on node which are not currently owning the IP addresses, which is expected. However Default-Public-IPv4-Is-Default-Src is +INF colocated with public IP addresses, so I do not understand why such errors are generated on inactive nodes. This is the probe-action, which will check whether the resource has the expected status (e.g. stopped for nodes where it's not running). You can either setup another IP on the same network on the interface to avoid these errors, or setting cidr_netmask and interface might help. IPsrcaddr doesnt advertise the interface parameter, so you probably have to do e.g. "pcs resource update -f Default-Public-IPv4-Is-Default-Src nic=" to set it anyways, so findif will be able to use it. Thanks ! You got it, it was indeed related to that. I tried setting up "nic" but it told me the parameter did not exist so I guessed it was not possible. Is that normal to use "private" attribute with --force ? Oyvind Albrigtsen Here are some config extracts: primitive Default-Public-IPv4 IPaddr2 \ params cidr_netmask=24 ip=1.1.1.1 nic=eth1 \ op monitor interval=30 \ op start interval=0s timeout=20s \ op stop interval=0s timeout=20s primitive IPSEC-Public-IPv4 IPaddr2 \ params cidr_netmask=24 ip=1.1.1.2 nic=eth1 \ op monitor interval=30 \ op start interval=0s timeout=20s \ op stop interval=0s timeout=20s \ meta target-role=Started primitive Public-IPv4-Gateway Route \ params destination="0.0.0.0/0" device=eth1 gateway=1.1.1.254 \ op monitor interval=30 \ op reload interval=0s timeout=20s \ op start interval=0s timeout=20s \ op stop interval=0s timeout=20s primitive Default-Public-IPv4-Is-Default-Src IPsrcaddr \ params cidr_netmask=24 ipaddress=1.1.1.1 \ op monitor interval=30 \ op start interval=0s timeout=20s \ op stop interval=0s timeout=20s \ meta target-role=Started colocation colocation-Default-Public-IPv4-Is-Default-Src-Default-Public-IPv4-INFINITY +inf: Default-Public-IPv4-Is-Default-Src Default-Public-IPv4 colocation colocation-Default-Public-IPv4-Public-IPv4-Gateway-INFINITY +inf: Default-Public-IPv4 Public-IPv4-Gateway colocation colocation-IPSEC-Public-IPv4-Public-IPv4-Gateway-INFINITY +inf: IPSEC-Public-IPv4 Public-IPv4-Gateway order order-Default-Public-IPv4-Default-Public-IPv4-Is-Default-Src-mandatory Default-Public-IPv4:start Default-Public-IPv4-Is-Default-Src:start order order-Default-Public-IPv4-IPSEC-Public-IPv4-mandatory Default-Public-IPv4:start IPSEC-Public-IPv4:start order order-Default-Public-IPv4-Public-IPv4-Gateway-mandatory Default-Public-IPv4:start Public-IPv4-Gateway:start Any hint would be greatly appreciated ! Best regards, Adam. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] ocf:heartbeat:IPsrcaddr generated failed probe "[findif] failed" on inactive nodes
Hello, Crm_mon show these errors on my cluster, while everything is working as expected: Failed Resource Actions: * Default-Public-IPv4-Is-Default-Src probe on gw-3.domain returned 'error' ([findif] failed) at Wed Feb 7 08:00:22 2024 after 49ms * Default-Public-IPv4-Is-Default-Src probe on gw-1.domain returned 'error' ([findif] failed) at Wed Feb 7 08:00:22 2024 after 48ms * Default-Public-IPv4-Is-Default-Src probe on gw-2.domain returned 'error' ([findif] failed) at Wed Feb 7 08:02:31 2024 after 64ms I think pacemaker is unable to check default source address on node which are not currently owning the IP addresses, which is expected. However Default-Public-IPv4-Is-Default-Src is +INF colocated with public IP addresses, so I do not understand why such errors are generated on inactive nodes. Here are some config extracts: primitive Default-Public-IPv4 IPaddr2 \ params cidr_netmask=24 ip=1.1.1.1 nic=eth1 \ op monitor interval=30 \ op start interval=0s timeout=20s \ op stop interval=0s timeout=20s primitive IPSEC-Public-IPv4 IPaddr2 \ params cidr_netmask=24 ip=1.1.1.2 nic=eth1 \ op monitor interval=30 \ op start interval=0s timeout=20s \ op stop interval=0s timeout=20s \ meta target-role=Started primitive Public-IPv4-Gateway Route \ params destination="0.0.0.0/0" device=eth1 gateway=1.1.1.254 \ op monitor interval=30 \ op reload interval=0s timeout=20s \ op start interval=0s timeout=20s \ op stop interval=0s timeout=20s primitive Default-Public-IPv4-Is-Default-Src IPsrcaddr \ params cidr_netmask=24 ipaddress=1.1.1.1 \ op monitor interval=30 \ op start interval=0s timeout=20s \ op stop interval=0s timeout=20s \ meta target-role=Started colocation colocation-Default-Public-IPv4-Is-Default-Src-Default-Public-IPv4-INFINITY +inf: Default-Public-IPv4-Is-Default-Src Default-Public-IPv4 colocation colocation-Default-Public-IPv4-Public-IPv4-Gateway-INFINITY +inf: Default-Public-IPv4 Public-IPv4-Gateway colocation colocation-IPSEC-Public-IPv4-Public-IPv4-Gateway-INFINITY +inf: IPSEC-Public-IPv4 Public-IPv4-Gateway order order-Default-Public-IPv4-Default-Public-IPv4-Is-Default-Src-mandatory Default-Public-IPv4:start Default-Public-IPv4-Is-Default-Src:start order order-Default-Public-IPv4-IPSEC-Public-IPv4-mandatory Default-Public-IPv4:start IPSEC-Public-IPv4:start order order-Default-Public-IPv4-Public-IPv4-Gateway-mandatory Default-Public-IPv4:start Public-IPv4-Gateway:start Any hint would be greatly appreciated ! Best regards, Adam. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Beginner lost with promotable "group" design
On 1/17/24 16:33, Ken Gaillot wrote: On Wed, 2024-01-17 at 14:23 +0100, Adam Cécile wrote: Hello, I'm trying to achieve the following setup with 3 hosts: * One master gets a shared IP, then remove default gw, add another gw, start a service * Two slaves should have none of them but add a different default gw I managed quite easily to get the master workflow running with ordering constraints but I don't understand how I should move forward with the slave configuration. I think I must create a promotable resource first then assign my other resources with started/stopped setting depending on the promote status of the node. Is that correct ? How to create a promotable "placeholder" where I can later attach my existing resources ? A promotable resource would be appropriate if the service should run on all nodes, but one node runs with a special setting. That doesn't sound like what you have. If you just need the service to run on one node, the shared IP, service, and both gateways can be regular resources. You just need colocation constraints between them: - colocate service and external default route with shared IP - clone the internal default route and anti-colocate it with shared IP If you want the service to be able to run even if the IP can't, make its colocation score finite (or colocate the IP and external route with the service). Ordering is separate. You can order the shared IP, service, and external route however needed. Alternatively, you can put the three of them in a group (which does both colocation and ordering, in sequence), and anti-colocate the cloned internal route with the group. Sorry for the stupid question but I really don't understand what type of elements I should create... Thanks in advance, Regards, Adam. PS: Bonus question should I use "pcs" or "crm" ? It seems both command seem to be equivalent and documentations use sometime one or another They are equivalent -- it's a matter of personal preference (and often what choices your distro give you). Hello, Thanks a lot for your suggestion, it seems I have something that work correctly now, final configuration is: pcs property set stonith-enabled=false pcs resource create Internal-IPv4 ocf:heartbeat:IPaddr2 ip=10.0.0.254 nic=eth0 cidr_netmask=24 op monitor interval=30 pcs resource create Public-IPv4 ocf:heartbeat:IPaddr2 ip=1.2.3.4 nic=eth1 cidr_netmask=28 op monitor interval=30 pcs resource create Public-IPv4-Gateway ocf:heartbeat:Route destination=0.0.0.0/0 device=eth1 gateway=1.2.3.14 op monitor interval=30 pcs constraint colocation add Internal-IPv4 with Public-IPv4 score=+INFINITY pcs constraint colocation add Public-IPv4 with Public-IPv4-Gateway score=+INFINITY pcs constraint order Internal-IPv4 then Public-IPv4 pcs constraint order Public-IPv4 then Public-IPv4-Gateway pcs resource create Internal-IPv4-Gateway ocf:heartbeat:Route destination=0.0.0.0/0 device=eth0 gateway=10.0.0.254 op monitor interval=30 --force pcs resource clone Internal-IPv4-Gateway pcs constraint colocation add Internal-IPv4-Gateway-clone with Internal-IPv4 score=-INFINITY pcs stonith create vmfence fence_vmware_rest pcmk_host_map="gw-1:gw-1;gw-2:gw-2;gw-3:gw-3" ip=10.1.2.3 ssl=1 username=corosync@vsphere.local password=p4ssw0rd ssl_insecure=1 pcs property set stonith-enabled=true Any comment regarding this configuration ? I have a quick one regarding fencing. I disconnected eth0 from gw-3 and the VM has been restarted automatically, so I guess it's the fencing agent that kicked in. However, I left the VM in such state (so it's seen offline by other nodes) and I thought it would end up being powered off for good. However, it seems fencing agent is keeping it powered on. Is that expected ? Best regards, Adam. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Mutually exclusive resources ?
On 9/27/23 16:02, Ken Gaillot wrote: On Wed, 2023-09-27 at 15:42 +0300, Andrei Borzenkov wrote: On Wed, Sep 27, 2023 at 3:21 PM Adam Cecile wrote: Hello, I'm struggling to understand if it's possible to create some kind of constraint to avoid two different resources to be running on the same host. Basically, I'd like to have floating IP "1" and floating IP "2" always being assigned to DIFFERENT nodes. Is that something possible ? Sure, negative colocation constraint. Can you give me a hint ? Using crmsh: colcoation IP1-no-with-IP2 -inf: IP1 IP2 Thanks in advance, Adam. To elaborate, use -INFINITY if you want the IPs to *never* run on the same node, even if there are no other nodes available (meaning one of them has to stop). If you *prefer* that they run on different nodes, but want to allow them to run on the same node in a degraded cluster, use a finite negative score. That's exactly what I tried to do: crm configure primitive Freeradius systemd:freeradius.service op start interval=0 timeout=120 op stop interval=0 timeout=120 op monitor interval=60 timeout=100 crm configure clone Clone-Freeradius Freeradius crm configure primitive Shared-IPv4-Cisco-ISE-1 IPaddr2 params ip=10.1.1.1 nic=eth0 cidr_netmask=24 meta migration-threshold=2 op monitor interval=60 timeout=30 resource-stickiness=50 crm configure primitive Shared-IPv4-Cisco-ISE-2 IPaddr2 params ip=10.1.1.2 nic=eth0 cidr_netmask=24 meta migration-threshold=2 op monitor interval=60 timeout=30 resource-stickiness=50 crm configure location Shared-IPv4-Cisco-ISE-1-Prefer-BRT Shared-IPv4-Cisco-ISE-1 50: infra-brt crm configure location Shared-IPv4-Cisco-ISE-2-Prefer-BTZ Shared-IPv4-Cisco-ISE-2 50: infra-btz crm configure colocation Shared-IPv4-Cisco-ISE-Different-Nodes -100: Shared-IPv4-Cisco-ISE-1 Shared-IPv4-Cisco-ISE-2 My hope is that IP1 stays in infra-brt and IP2 goes on infra-btz. I want to allow them to keep running on different host so I also added stickiness. However, I really do not want them to both run on same node so I added a colocation with negative higher score. Does it looks good to you ? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Mutually exclusive resources ?
Hello, I'm struggling to understand if it's possible to create some kind of constraint to avoid two different resources to be running on the same host. Basically, I'd like to have floating IP "1" and floating IP "2" always being assigned to DIFFERENT nodes. Is that something possible ? Can you give me a hint ? Thanks in advance, Adam. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Systemd resource started on node after reboot before cluster is stable ?
On 2/16/23 20:54, Ken Gaillot wrote: On Thu, 2023-02-16 at 11:13 +0100, Adam Cecile wrote: On 2/16/23 07:57, Ulrich Windl wrote: Adam Cecile schrieb am 15.02.2023 um 10:49 in Nachricht : Hello, Just had some issue with unexpected server behavior after reboot. This node was powered off, so cluster was running fine with this tomcat9 resource running on a different machine. After powering on this node again, it briefly started tomcat before joining the cluster and decided to stop it again. I'm not sure why. Here is the systemctl status tomcat9 on this host: tomcat9.service - Apache Tomcat 9 Web Application Server Loaded: loaded (/lib/systemd/system/tomcat9.service; disabled; vendor preset: enabled) Drop-In: /etc/systemd/system/tomcat9.service.d └─override.conf Active: inactive (dead) Docs:https://tomcat.apache.org/tomcat-9.0-doc/index.html Feb 15 09:43:27 server tomcat9[1398]: Starting service [Catalina] Feb 15 09:43:27 server tomcat9[1398]: Starting Servlet engine: [Apache Tomcat/9.0.43 (Debian)] Feb 15 09:43:27 server tomcat9[1398]: [...] Feb 15 09:43:29 server systemd[1]: Stopping Apache Tomcat 9 Web Application Server... Feb 15 09:43:29 server systemd[1]: tomcat9.service: Succeeded. Feb 15 09:43:29 server systemd[1]: Stopped Apache Tomcat 9 Web Application Server. Feb 15 09:43:29 server systemd[1]: tomcat9.service: Consumed 8.017s CPU time. You can see it is disabled and should NOT be started with the same, start/stop is under Corosync control The systemd resource is defined like this: primitive tomcat9 systemd:tomcat9.service \ op start interval=0 timeout=120 \ op stop interval=0 timeout=120 \ op monitor interval=60 timeout=100 Any idea why this happened ? Your journal (syslog) should tell you! Indeed, I overlooked yesterday... But it says it's pacemaker that decided to start it: Feb 15 09:43:26 server3 corosync[568]: [QUORUM] Sync members[3]: 1 2 3 Feb 15 09:43:26 server3 corosync[568]: [QUORUM] Sync joined[2]: 1 2 Feb 15 09:43:26 server3 corosync[568]: [TOTEM ] A new membership (1.42d) was formed. Members joined: 1 2 Feb 15 09:43:26 server3 pacemaker-attrd[860]: notice: Node server1 state is now member Feb 15 09:43:26 server3 pacemaker-based[857]: notice: Node server1 state is now member Feb 15 09:43:26 server3 corosync[568]: [QUORUM] This node is within the primary component and will provide service. Feb 15 09:43:26 server3 corosync[568]: [QUORUM] Members[3]: 1 2 3 Feb 15 09:43:26 server3 corosync[568]: [MAIN ] Completed service synchronization, ready to provide service. Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: Quorum acquired Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: Node server1 state is now member Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: Node server2 state is now member Feb 15 09:43:26 server3 pacemaker-based[857]: notice: Node server2 state is now member Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: Transition 0 aborted: Peer Halt Feb 15 09:43:26 server3 pacemaker-fenced[858]: notice: Node server1 state is now member Feb 15 09:43:26 server3 pacemaker-controld[862]: warning: Another DC detected: server2 (op=noop) Feb 15 09:43:26 server3 pacemaker-fenced[858]: notice: Node server2 state is now member Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: State transition S_ELECTION -> S_RELEASE_DC Feb 15 09:43:26 server3 pacemaker-controld[862]: warning: Cancelling timer for action 12 (src=67) Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: No need to invoke the TE (A_TE_HALT) in state S_RELEASE_DC Feb 15 09:43:26 server3 pacemaker-attrd[860]: notice: Node server2 state is now member Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: State transition S_PENDING -> S_NOT_DC Feb 15 09:43:27 server3 pacemaker-attrd[860]: notice: Setting #attrd-protocol[server1]: (unset) -> 2 Feb 15 09:43:27 server3 pacemaker-attrd[860]: notice: Detected another attribute writer (server2), starting new election Feb 15 09:43:27 server3 pacemaker-attrd[860]: notice: Setting #attrd-protocol[server2]: (unset) -> 2 Feb 15 09:43:27 server3 IPaddr2(Shared-IPv4)[1258]: INFO: Feb 15 09:43:27 server3 ntpd[602]: Listen normally on 8 eth0 10.13.68.12:123 Feb 15 09:43:27 server3 ntpd[602]: new interface(s) found: waking up resolver => Feb 15 09:43:28 server3 pacemaker-controld[862]: notice: Result of start operation for tomcat9 on server3: ok Feb 15 09:43:29 server3 corosync[568]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 485 to 1397 Feb 15 09:43:29 server3 corosync[568]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 485 to 1397 Feb 15 09:43:29 server3 corosync[568]: [KNET ] pmtud: Global data MTU changed to: 1397 => Feb 15 09:43:29 server3 pacemaker-controld[862]: notice: Requesting local execution of stop operation for tomcat9 on server3 Any idea ? What do the logs on
Re: [ClusterLabs] Antw: [EXT] Systemd resource started on node after reboot before cluster is stable ?
On 2/16/23 07:57, Ulrich Windl wrote: Adam Cecile schrieb am 15.02.2023 um 10:49 in Nachricht : Hello, Just had some issue with unexpected server behavior after reboot. This node was powered off, so cluster was running fine with this tomcat9 resource running on a different machine. After powering on this node again, it briefly started tomcat before joining the cluster and decided to stop it again. I'm not sure why. Here is the systemctl status tomcat9 on this host: tomcat9.service - Apache Tomcat 9 Web Application Server Loaded: loaded (/lib/systemd/system/tomcat9.service; disabled; vendor preset: enabled) Drop-In: /etc/systemd/system/tomcat9.service.d └─override.conf Active: inactive (dead) Docs:https://tomcat.apache.org/tomcat-9.0-doc/index.html Feb 15 09:43:27 server tomcat9[1398]: Starting service [Catalina] Feb 15 09:43:27 server tomcat9[1398]: Starting Servlet engine: [Apache Tomcat/9.0.43 (Debian)] Feb 15 09:43:27 server tomcat9[1398]: [...] Feb 15 09:43:29 server systemd[1]: Stopping Apache Tomcat 9 Web Application Server... Feb 15 09:43:29 server systemd[1]: tomcat9.service: Succeeded. Feb 15 09:43:29 server systemd[1]: Stopped Apache Tomcat 9 Web Application Server. Feb 15 09:43:29 server systemd[1]: tomcat9.service: Consumed 8.017s CPU time. You can see it is disabled and should NOT be started with the same, start/stop is under Corosync control The systemd resource is defined like this: primitive tomcat9 systemd:tomcat9.service \ op start interval=0 timeout=120 \ op stop interval=0 timeout=120 \ op monitor interval=60 timeout=100 Any idea why this happened ? Your journal (syslog) should tell you! Indeed, I overlooked yesterday... But it says it's pacemaker that decided to start it: Feb 15 09:43:26 server3 corosync[568]: [QUORUM] Sync members[3]: 1 2 3 Feb 15 09:43:26 server3 corosync[568]: [QUORUM] Sync joined[2]: 1 2 Feb 15 09:43:26 server3 corosync[568]: [TOTEM ] A new membership (1.42d) was formed. Members joined: 1 2 Feb 15 09:43:26 server3 pacemaker-attrd[860]: notice: Node server1 state is now member Feb 15 09:43:26 server3 pacemaker-based[857]: notice: Node server1 state is now member Feb 15 09:43:26 server3 corosync[568]: [QUORUM] This node is within the primary component and will provide service. Feb 15 09:43:26 server3 corosync[568]: [QUORUM] Members[3]: 1 2 3 Feb 15 09:43:26 server3 corosync[568]: [MAIN ] Completed service synchronization, ready to provide service. Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: Quorum acquired Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: Node server1 state is now member Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: Node server2 state is now member Feb 15 09:43:26 server3 pacemaker-based[857]: notice: Node server2 state is now member Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: Transition 0 aborted: Peer Halt Feb 15 09:43:26 server3 pacemaker-fenced[858]: notice: Node server1 state is now member Feb 15 09:43:26 server3 pacemaker-controld[862]: warning: Another DC detected: server2 (op=noop) Feb 15 09:43:26 server3 pacemaker-fenced[858]: notice: Node server2 state is now member Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: State transition S_ELECTION -> S_RELEASE_DC Feb 15 09:43:26 server3 pacemaker-controld[862]: warning: Cancelling timer for action 12 (src=67) Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: No need to invoke the TE (A_TE_HALT) in state S_RELEASE_DC Feb 15 09:43:26 server3 pacemaker-attrd[860]: notice: Node server2 state is now member Feb 15 09:43:26 server3 pacemaker-controld[862]: notice: State transition S_PENDING -> S_NOT_DC Feb 15 09:43:27 server3 pacemaker-attrd[860]: notice: Setting #attrd-protocol[server1]: (unset) -> 2 Feb 15 09:43:27 server3 pacemaker-attrd[860]: notice: Detected another attribute writer (server2), starting new election Feb 15 09:43:27 server3 pacemaker-attrd[860]: notice: Setting #attrd-protocol[server2]: (unset) -> 2 Feb 15 09:43:27 server3 IPaddr2(Shared-IPv4)[1258]: INFO: Feb 15 09:43:27 server3 ntpd[602]: Listen normally on 8 eth0 10.13.68.12:123 Feb 15 09:43:27 server3 ntpd[602]: new interface(s) found: waking up resolver => Feb 15 09:43:28 server3 pacemaker-controld[862]: notice: Result of start operation for tomcat9 on server3: ok Feb 15 09:43:29 server3 corosync[568]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 485 to 1397 Feb 15 09:43:29 server3 corosync[568]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 485 to 1397 Feb 15 09:43:29 server3 corosync[568]: [KNET ] pmtud: Global data MTU changed to: 1397 => Feb 15 09:43:29 server3 pacemaker-controld[862]: notice: Requesting local execution of stop operation for tomcat9 on server3 Any idea ? ___ Manage your s
Re: [ClusterLabs] Systemd resource started on node after reboot before cluster is stable ?
On 2/15/23 12:16, Andrei Borzenkov wrote: On Wed, Feb 15, 2023 at 12:49 PM Adam Cecile wrote: Hello, Just had some issue with unexpected server behavior after reboot. This node was powered off, so cluster was running fine with this tomcat9 resource running on a different machine. After powering on this node again, it briefly started tomcat before joining the cluster and decided to stop it again. I'm not sure why. Here is the systemctl status tomcat9 on this host: tomcat9.service - Apache Tomcat 9 Web Application Server Loaded: loaded (/lib/systemd/system/tomcat9.service; disabled; vendor preset: enabled) Drop-In: /etc/systemd/system/tomcat9.service.d └─override.conf Active: inactive (dead) Docs:https://tomcat.apache.org/tomcat-9.0-doc/index.html Feb 15 09:43:27 server tomcat9[1398]: Starting service [Catalina] Feb 15 09:43:27 server tomcat9[1398]: Starting Servlet engine: [Apache Tomcat/9.0.43 (Debian)] Feb 15 09:43:27 server tomcat9[1398]: [...] Feb 15 09:43:29 server systemd[1]: Stopping Apache Tomcat 9 Web Application Server... Feb 15 09:43:29 server systemd[1]: tomcat9.service: Succeeded. Feb 15 09:43:29 server systemd[1]: Stopped Apache Tomcat 9 Web Application Server. Feb 15 09:43:29 server systemd[1]: tomcat9.service: Consumed 8.017s CPU time. You can see it is disabled and should NOT be started "Disabled" in systemd just means that links in [Install] section are not present. This unit may be started by explicit request, or by explicit dependency like Wants or Requires in another unit. Check "systemctl show -p WantedBy -p RequiredBy tomcat9.service". Sadly it is configured as it should: RequiredBy= WantedBy= ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Systemd resource started on node after reboot before cluster is stable ?
Hello, Just had some issue with unexpected server behavior after reboot. This node was powered off, so cluster was running fine with this tomcat9 resource running on a different machine. After powering on this node again, it briefly started tomcat before joining the cluster and decided to stop it again. I'm not sure why. Here is the systemctl status tomcat9 on this host: tomcat9.service - Apache Tomcat 9 Web Application Server Loaded: loaded (/lib/systemd/system/tomcat9.service; disabled; vendor preset: enabled) Drop-In: /etc/systemd/system/tomcat9.service.d └─override.conf Active: inactive (dead) Docs: https://tomcat.apache.org/tomcat-9.0-doc/index.html Feb 15 09:43:27 server tomcat9[1398]: Starting service [Catalina] Feb 15 09:43:27 server tomcat9[1398]: Starting Servlet engine: [Apache Tomcat/9.0.43 (Debian)] Feb 15 09:43:27 server tomcat9[1398]: [...] Feb 15 09:43:29 server systemd[1]: Stopping Apache Tomcat 9 Web Application Server... Feb 15 09:43:29 server systemd[1]: tomcat9.service: Succeeded. Feb 15 09:43:29 server systemd[1]: Stopped Apache Tomcat 9 Web Application Server. Feb 15 09:43:29 server systemd[1]: tomcat9.service: Consumed 8.017s CPU time. You can see it is disabled and should NOT be started with the same, start/stop is under Corosync control The systemd resource is defined like this: primitive tomcat9 systemd:tomcat9.service \ op start interval=0 timeout=120 \ op stop interval=0 timeout=120 \ op monitor interval=60 timeout=100 Any idea why this happened ? Best regards, Adam. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Reload DNSMasq after IPAddr2 change ?
Hello, I might be stupid but I'm completely stuck with this requirement. We just figured out DNSMasq proxy is not working correctly after shared IP address is moved from one host to another because it does not listen on the new address. My need is to issue a reload statement to DNSMasq to make it work again but I failed to find anyone describing how to implement this so I guess I completely wrong. Could someone explain me how I'm supposed to handle such situation ? Best regards, Adam. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Automatic recover from split brain ?
Hello, I'm experiencing issue with corosync/pacemaker running on Debian Buster. Cluster has three nodes running in VMWare virtual machine and the cluster fails when VEEAM backups the virtual machine (I know it's doing bad things, like freezing completely the VM for a few minutes to make disk snapshot). My biggest issue is that once the backup has been completed, the cluster stays in split brain state, and I'd like it to heal itself. Here current status: One node is isolated: Stack: corosync Current DC: host2.domain.com (version 2.0.1-9e909a5bdd) - partition WITHOUT quorum Last updated: Sat Aug 8 11:59:46 2020 Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on host1.domain.com 3 nodes configured 6 resources configured Online: [ host2.domain.com ] OFFLINE: [ host3.domain.com host1.domain.com ] Two others are seeing each others: Stack: corosync Current DC: host3.domain.com (version 2.0.1-9e909a5bdd) - partition with quorum Last updated: Sat Aug 8 12:07:56 2020 Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on host1.domain.com 3 nodes configured 6 resources configured Online: [ host3.domain.com host1.domain.com ] OFFLINE: [ host2.domain.com ] The problem is that one of the resources is a floating IP address which is currently assigned to two different hosts... Can you help me configuring the cluster correctly so this cannot occurs ? Thanks in advance, Adam. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/