Re: [ClusterLabs] ocf:heartbeat:IPsrcaddr generated failed probe "[findif] failed" on inactive nodes

2024-02-07 Thread Adam Cecile

On 2/7/24 09:49, Oyvind Albrigtsen wrote:

On 07/02/24 09:35 +0100, Adam Cecile wrote:

Hello,


Crm_mon show these errors on my cluster, while everything is working 
as expected:


Failed Resource Actions:
  * Default-Public-IPv4-Is-Default-Src probe on gw-3.domain returned 
'error' ([findif] failed) at Wed Feb  7 08:00:22 2024 after 49ms
  * Default-Public-IPv4-Is-Default-Src probe on gw-1.domain returned 
'error' ([findif] failed) at Wed Feb  7 08:00:22 2024 after 48ms
  * Default-Public-IPv4-Is-Default-Src probe on gw-2.domain returned 
'error' ([findif] failed) at Wed Feb  7 08:02:31 2024 after 64ms


I think pacemaker is unable to check default source address on node 
which are not currently owning the IP addresses, which is expected. 
However Default-Public-IPv4-Is-Default-Src is +INF colocated with 
public IP addresses, so I do not understand why such errors are 
generated on inactive nodes.

This is the probe-action, which will check whether the resource has
the expected status (e.g. stopped for nodes where it's not running).

You can either setup another IP on the same network on the interface
to avoid these errors, or setting cidr_netmask and interface might help.

IPsrcaddr doesnt advertise the interface parameter, so you probably
have to do e.g. "pcs resource update -f
Default-Public-IPv4-Is-Default-Src nic=" to set it anyways,
so findif will be able to use it.

Thanks ! You got it, it was indeed related to that. I tried setting up 
"nic" but it told me the parameter did not exist so I guessed it was not 
possible.


Is that normal to use "private" attribute with --force ?



Oyvind Albrigtsen


Here are some config extracts:

primitive Default-Public-IPv4 IPaddr2 \
    params cidr_netmask=24 ip=1.1.1.1 nic=eth1 \
    op monitor interval=30 \
    op start interval=0s timeout=20s \
    op stop interval=0s timeout=20s

primitive IPSEC-Public-IPv4 IPaddr2 \
    params cidr_netmask=24 ip=1.1.1.2 nic=eth1 \
    op monitor interval=30 \
    op start interval=0s timeout=20s \
    op stop interval=0s timeout=20s \
    meta target-role=Started

primitive Public-IPv4-Gateway Route \
    params destination="0.0.0.0/0" device=eth1 gateway=1.1.1.254 \
    op monitor interval=30 \
    op reload interval=0s timeout=20s \
    op start interval=0s timeout=20s \
    op stop interval=0s timeout=20s

primitive Default-Public-IPv4-Is-Default-Src IPsrcaddr \
    params cidr_netmask=24 ipaddress=1.1.1.1 \
    op monitor interval=30 \
    op start interval=0s timeout=20s \
    op stop interval=0s timeout=20s \
    meta target-role=Started

colocation 
colocation-Default-Public-IPv4-Is-Default-Src-Default-Public-IPv4-INFINITY 
+inf: Default-Public-IPv4-Is-Default-Src Default-Public-IPv4
colocation 
colocation-Default-Public-IPv4-Public-IPv4-Gateway-INFINITY +inf: 
Default-Public-IPv4 Public-IPv4-Gateway
colocation colocation-IPSEC-Public-IPv4-Public-IPv4-Gateway-INFINITY 
+inf: IPSEC-Public-IPv4 Public-IPv4-Gateway


order 
order-Default-Public-IPv4-Default-Public-IPv4-Is-Default-Src-mandatory 
Default-Public-IPv4:start Default-Public-IPv4-Is-Default-Src:start
order order-Default-Public-IPv4-IPSEC-Public-IPv4-mandatory 
Default-Public-IPv4:start IPSEC-Public-IPv4:start
order order-Default-Public-IPv4-Public-IPv4-Gateway-mandatory 
Default-Public-IPv4:start Public-IPv4-Gateway:start



Any hint would be greatly appreciated !

Best regards, Adam.



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] ocf:heartbeat:IPsrcaddr generated failed probe "[findif] failed" on inactive nodes

2024-02-07 Thread Adam Cecile

Hello,


Crm_mon show these errors on my cluster, while everything is working as 
expected:


Failed Resource Actions:
  * Default-Public-IPv4-Is-Default-Src probe on gw-3.domain returned 
'error' ([findif] failed) at Wed Feb  7 08:00:22 2024 after 49ms
  * Default-Public-IPv4-Is-Default-Src probe on gw-1.domain returned 
'error' ([findif] failed) at Wed Feb  7 08:00:22 2024 after 48ms
  * Default-Public-IPv4-Is-Default-Src probe on gw-2.domain returned 
'error' ([findif] failed) at Wed Feb  7 08:02:31 2024 after 64ms


I think pacemaker is unable to check default source address on node 
which are not currently owning the IP addresses, which is expected. 
However Default-Public-IPv4-Is-Default-Src is +INF colocated with public 
IP addresses, so I do not understand why such errors are generated on 
inactive nodes.


Here are some config extracts:

primitive Default-Public-IPv4 IPaddr2 \
    params cidr_netmask=24 ip=1.1.1.1 nic=eth1 \
    op monitor interval=30 \
    op start interval=0s timeout=20s \
    op stop interval=0s timeout=20s

primitive IPSEC-Public-IPv4 IPaddr2 \
    params cidr_netmask=24 ip=1.1.1.2 nic=eth1 \
    op monitor interval=30 \
    op start interval=0s timeout=20s \
    op stop interval=0s timeout=20s \
    meta target-role=Started

primitive Public-IPv4-Gateway Route \
    params destination="0.0.0.0/0" device=eth1 gateway=1.1.1.254 \
    op monitor interval=30 \
    op reload interval=0s timeout=20s \
    op start interval=0s timeout=20s \
    op stop interval=0s timeout=20s

primitive Default-Public-IPv4-Is-Default-Src IPsrcaddr \
    params cidr_netmask=24 ipaddress=1.1.1.1 \
    op monitor interval=30 \
    op start interval=0s timeout=20s \
    op stop interval=0s timeout=20s \
    meta target-role=Started

colocation 
colocation-Default-Public-IPv4-Is-Default-Src-Default-Public-IPv4-INFINITY 
+inf: Default-Public-IPv4-Is-Default-Src Default-Public-IPv4
colocation colocation-Default-Public-IPv4-Public-IPv4-Gateway-INFINITY 
+inf: Default-Public-IPv4 Public-IPv4-Gateway
colocation colocation-IPSEC-Public-IPv4-Public-IPv4-Gateway-INFINITY 
+inf: IPSEC-Public-IPv4 Public-IPv4-Gateway


order 
order-Default-Public-IPv4-Default-Public-IPv4-Is-Default-Src-mandatory 
Default-Public-IPv4:start Default-Public-IPv4-Is-Default-Src:start
order order-Default-Public-IPv4-IPSEC-Public-IPv4-mandatory 
Default-Public-IPv4:start IPSEC-Public-IPv4:start
order order-Default-Public-IPv4-Public-IPv4-Gateway-mandatory 
Default-Public-IPv4:start Public-IPv4-Gateway:start



Any hint would be greatly appreciated !

Best regards, Adam.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Beginner lost with promotable "group" design

2024-01-31 Thread Adam Cecile

On 1/17/24 16:33, Ken Gaillot wrote:

On Wed, 2024-01-17 at 14:23 +0100, Adam Cécile wrote:

Hello,


I'm trying to achieve the following setup with 3 hosts:

* One master gets a shared IP, then remove default gw, add another
gw,
start a service

* Two slaves should have none of them but add a different default gw

I managed quite easily to get the master workflow running with
ordering
constraints but I don't understand how I should move forward with
the
slave configuration.

I think I must create a promotable resource first then assign my
other
resources with started/stopped  setting depending on the promote
status
of the node. Is that correct ? How to create a promotable
"placeholder"
where I can later attach my existing resources ?

A promotable resource would be appropriate if the service should run on
all nodes, but one node runs with a special setting. That doesn't sound
like what you have.

If you just need the service to run on one node, the shared IP,
service, and both gateways can be regular resources. You just need
colocation constraints between them:

- colocate service and external default route with shared IP
- clone the internal default route and anti-colocate it with shared IP

If you want the service to be able to run even if the IP can't, make
its colocation score finite (or colocate the IP and external route with
the service).

Ordering is separate. You can order the shared IP, service, and
external route however needed. Alternatively, you can put the three of
them in a group (which does both colocation and ordering, in sequence),
and anti-colocate the cloned internal route with the group.


Sorry for the stupid question but I really don't understand what type
of
elements I should create...


Thanks in advance,

Regards, Adam.


PS: Bonus question should I use "pcs" or "crm" ? It seems both
command
seem to be equivalent and documentations use sometime one or another


They are equivalent -- it's a matter of personal preference (and often
what choices your distro give you).


Hello,

Thanks a lot for your suggestion, it seems I have something that work 
correctly now, final configuration is:



pcs property set stonith-enabled=false

pcs resource create Internal-IPv4 ocf:heartbeat:IPaddr2 ip=10.0.0.254 
nic=eth0 cidr_netmask=24 op monitor interval=30
pcs resource create Public-IPv4 ocf:heartbeat:IPaddr2 ip=1.2.3.4 
nic=eth1 cidr_netmask=28 op monitor interval=30
pcs resource create Public-IPv4-Gateway ocf:heartbeat:Route 
destination=0.0.0.0/0 device=eth1 gateway=1.2.3.14 op monitor interval=30


pcs constraint colocation add Internal-IPv4 with Public-IPv4 score=+INFINITY
pcs constraint colocation add Public-IPv4 with Public-IPv4-Gateway 
score=+INFINITY


pcs constraint order Internal-IPv4 then Public-IPv4
pcs constraint order Public-IPv4 then Public-IPv4-Gateway

pcs resource create Internal-IPv4-Gateway ocf:heartbeat:Route 
destination=0.0.0.0/0 device=eth0 gateway=10.0.0.254 op monitor 
interval=30 --force

pcs resource clone Internal-IPv4-Gateway

pcs constraint colocation add Internal-IPv4-Gateway-clone with 
Internal-IPv4 score=-INFINITY


pcs stonith create vmfence fence_vmware_rest 
pcmk_host_map="gw-1:gw-1;gw-2:gw-2;gw-3:gw-3" ip=10.1.2.3 ssl=1 
username=corosync@vsphere.local password=p4ssw0rd ssl_insecure=1


pcs property set stonith-enabled=true


Any comment regarding this configuration ?


I have a quick one regarding fencing. I disconnected eth0 from gw-3 and 
the VM has been restarted automatically, so I guess it's the fencing 
agent that kicked in. However, I left the VM in such state (so it's seen 
offline by other nodes) and I thought it would end up being powered off 
for good. However, it seems fencing agent is keeping it powered on. Is 
that expected ?



Best regards, Adam.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Mutually exclusive resources ?

2023-09-27 Thread Adam Cecile

On 9/27/23 16:02, Ken Gaillot wrote:

On Wed, 2023-09-27 at 15:42 +0300, Andrei Borzenkov wrote:

On Wed, Sep 27, 2023 at 3:21 PM Adam Cecile
wrote:

Hello,


I'm struggling to understand if it's possible to create some kind
of constraint to avoid two different resources to be running on the
same host.

Basically, I'd like to have floating IP "1" and floating IP "2"
always being assigned to DIFFERENT nodes.

Is that something possible ?

Sure, negative colocation constraint.


Can you give me a hint ?


Using crmsh:

colcoation IP1-no-with-IP2 -inf: IP1 IP2


Thanks in advance, Adam.

To elaborate, use -INFINITY if you want the IPs to *never* run on the
same node, even if there are no other nodes available (meaning one of
them has to stop). If you *prefer* that they run on different nodes,
but want to allow them to run on the same node in a degraded cluster,
use a finite negative score.


That's exactly what I tried to do:

crm configure primitive Freeradius systemd:freeradius.service op start 
interval=0 timeout=120 op stop interval=0 timeout=120 op monitor 
interval=60 timeout=100

crm configure clone Clone-Freeradius Freeradius

crm configure primitive Shared-IPv4-Cisco-ISE-1 IPaddr2 params 
ip=10.1.1.1 nic=eth0 cidr_netmask=24 meta migration-threshold=2 op 
monitor interval=60 timeout=30 resource-stickiness=50
crm configure primitive Shared-IPv4-Cisco-ISE-2 IPaddr2 params 
ip=10.1.1.2 nic=eth0 cidr_netmask=24 meta migration-threshold=2 op 
monitor interval=60 timeout=30 resource-stickiness=50


crm configure location Shared-IPv4-Cisco-ISE-1-Prefer-BRT 
Shared-IPv4-Cisco-ISE-1 50: infra-brt
crm configure location Shared-IPv4-Cisco-ISE-2-Prefer-BTZ 
Shared-IPv4-Cisco-ISE-2 50: infra-btz
crm configure colocation Shared-IPv4-Cisco-ISE-Different-Nodes -100: 
Shared-IPv4-Cisco-ISE-1 Shared-IPv4-Cisco-ISE-2


My hope is that IP1 stays in infra-brt and IP2 goes on infra-btz. I want 
to allow them to keep running on different host so I also added 
stickiness. However, I really do not want them to both run on same node 
so I added a colocation with negative higher score.


Does it looks good to you ?

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Mutually exclusive resources ?

2023-09-27 Thread Adam Cecile

Hello,


I'm struggling to understand if it's possible to create some kind of 
constraint to avoid two different resources to be running on the same host.


Basically, I'd like to have floating IP "1" and floating IP "2" always 
being assigned to DIFFERENT nodes.


Is that something possible ? Can you give me a hint ?


Thanks in advance, Adam.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Systemd resource started on node after reboot before cluster is stable ?

2023-02-20 Thread Adam Cecile


On 2/16/23 20:54, Ken Gaillot wrote:

On Thu, 2023-02-16 at 11:13 +0100, Adam Cecile wrote:

On 2/16/23 07:57, Ulrich Windl wrote:

Adam Cecile  schrieb am 15.02.2023 um
10:49 in

Nachricht
:

Hello,

Just had some issue with unexpected server behavior after reboot.
This
node was powered off, so cluster was running fine with this
tomcat9
resource running on a different machine.

After powering on this node again, it briefly started tomcat
before
joining the cluster and decided to stop it again. I'm not sure
why.


Here is the systemctl status tomcat9 on this host:

tomcat9.service - Apache Tomcat 9 Web Application Server
   Loaded: loaded (/lib/systemd/system/tomcat9.service;
disabled;
vendor preset: enabled)
  Drop-In: /etc/systemd/system/tomcat9.service.d
   └─override.conf
   Active: inactive (dead)
 Docs:https://tomcat.apache.org/tomcat-9.0-doc/index.html
  


Feb 15 09:43:27 server tomcat9[1398]: Starting service [Catalina]
Feb 15 09:43:27 server tomcat9[1398]: Starting Servlet engine:
[Apache
Tomcat/9.0.43 (Debian)]
Feb 15 09:43:27 server tomcat9[1398]: [...]
Feb 15 09:43:29 server systemd[1]: Stopping Apache Tomcat 9 Web
Application Server...
Feb 15 09:43:29 server systemd[1]: tomcat9.service: Succeeded.
Feb 15 09:43:29 server systemd[1]: Stopped Apache Tomcat 9 Web
Application Server.
Feb 15 09:43:29 server systemd[1]: tomcat9.service: Consumed
8.017s CPU
time.

You can see it is disabled and should NOT be started with the
same,
start/stop is under Corosync control


The systemd resource is defined like this:

primitive tomcat9 systemd:tomcat9.service \
  op start interval=0 timeout=120 \
  op stop interval=0 timeout=120 \
  op monitor interval=60 timeout=100


Any idea why this happened ?

Your journal (syslog) should tell you!

Indeed, I overlooked yesterday... But it says it's pacemaker that
decided to start it:

Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] Sync members[3]: 1
2 3
Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] Sync joined[2]: 1 2
Feb 15 09:43:26 server3 corosync[568]:   [TOTEM ] A new membership
(1.42d) was formed. Members joined: 1 2
Feb 15 09:43:26 server3 pacemaker-attrd[860]:  notice: Node server1
state is now member
Feb 15 09:43:26 server3 pacemaker-based[857]:  notice: Node server1
state is now member
Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] This node is within
the primary component and will provide service.
Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] Members[3]: 1 2 3
Feb 15 09:43:26 server3 corosync[568]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Quorum
acquired
Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Node
server1 state is now member
Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Node
server2 state is now member
Feb 15 09:43:26 server3 pacemaker-based[857]:  notice: Node server2
state is now member
Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Transition
0 aborted: Peer Halt
Feb 15 09:43:26 server3 pacemaker-fenced[858]:  notice: Node server1
state is now member
Feb 15 09:43:26 server3 pacemaker-controld[862]:  warning: Another DC
detected: server2 (op=noop)
Feb 15 09:43:26 server3 pacemaker-fenced[858]:  notice: Node server2
state is now member
Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: State
transition S_ELECTION -> S_RELEASE_DC
Feb 15 09:43:26 server3 pacemaker-controld[862]:  warning: Cancelling
timer for action 12 (src=67)
Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: No need to
invoke the TE (A_TE_HALT) in state S_RELEASE_DC
Feb 15 09:43:26 server3 pacemaker-attrd[860]:  notice: Node server2
state is now member
Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: State
transition S_PENDING -> S_NOT_DC
Feb 15 09:43:27 server3 pacemaker-attrd[860]:  notice: Setting
#attrd-protocol[server1]: (unset) -> 2
Feb 15 09:43:27 server3 pacemaker-attrd[860]:  notice: Detected
another attribute writer (server2), starting new election
Feb 15 09:43:27 server3 pacemaker-attrd[860]:  notice: Setting
#attrd-protocol[server2]: (unset) -> 2
Feb 15 09:43:27 server3 IPaddr2(Shared-IPv4)[1258]: INFO:
Feb 15 09:43:27 server3 ntpd[602]: Listen normally on 8 eth0
10.13.68.12:123
Feb 15 09:43:27 server3 ntpd[602]: new interface(s) found: waking up
resolver
=> Feb 15 09:43:28 server3 pacemaker-controld[862]:  notice: Result
of start operation for tomcat9 on server3: ok
Feb 15 09:43:29 server3 corosync[568]:   [KNET  ] pmtud: PMTUD link
change for host: 2 link: 0 from 485 to 1397
Feb 15 09:43:29 server3 corosync[568]:   [KNET  ] pmtud: PMTUD link
change for host: 1 link: 0 from 485 to 1397
Feb 15 09:43:29 server3 corosync[568]:   [KNET  ] pmtud: Global data
MTU changed to: 1397
=> Feb 15 09:43:29 server3 pacemaker-controld[862]:  notice:
Requesting local execution of stop operation for tomcat9 on server3

Any idea ?

What do the logs on 

Re: [ClusterLabs] Antw: [EXT] Systemd resource started on node after reboot before cluster is stable ?

2023-02-16 Thread Adam Cecile


On 2/16/23 07:57, Ulrich Windl wrote:

Adam Cecile  schrieb am 15.02.2023 um 10:49 in

Nachricht
:

Hello,

Just had some issue with unexpected server behavior after reboot. This
node was powered off, so cluster was running fine with this tomcat9
resource running on a different machine.

After powering on this node again, it briefly started tomcat before
joining the cluster and decided to stop it again. I'm not sure why.


Here is the systemctl status tomcat9 on this host:

tomcat9.service - Apache Tomcat 9 Web Application Server
   Loaded: loaded (/lib/systemd/system/tomcat9.service; disabled;
vendor preset: enabled)
  Drop-In: /etc/systemd/system/tomcat9.service.d
   └─override.conf
   Active: inactive (dead)
 Docs:https://tomcat.apache.org/tomcat-9.0-doc/index.html  


Feb 15 09:43:27 server tomcat9[1398]: Starting service [Catalina]
Feb 15 09:43:27 server tomcat9[1398]: Starting Servlet engine: [Apache
Tomcat/9.0.43 (Debian)]
Feb 15 09:43:27 server tomcat9[1398]: [...]
Feb 15 09:43:29 server systemd[1]: Stopping Apache Tomcat 9 Web
Application Server...
Feb 15 09:43:29 server systemd[1]: tomcat9.service: Succeeded.
Feb 15 09:43:29 server systemd[1]: Stopped Apache Tomcat 9 Web
Application Server.
Feb 15 09:43:29 server systemd[1]: tomcat9.service: Consumed 8.017s CPU
time.

You can see it is disabled and should NOT be started with the same,
start/stop is under Corosync control


The systemd resource is defined like this:

primitive tomcat9 systemd:tomcat9.service \
  op start interval=0 timeout=120 \
  op stop interval=0 timeout=120 \
  op monitor interval=60 timeout=100


Any idea why this happened ?

Your journal (syslog) should tell you!


Indeed, I overlooked yesterday... But it says it's pacemaker that 
decided to start it:



Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] Sync members[3]: 1 2 3
Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] Sync joined[2]: 1 2
Feb 15 09:43:26 server3 corosync[568]:   [TOTEM ] A new membership 
(1.42d) was formed. Members joined: 1 2
Feb 15 09:43:26 server3 pacemaker-attrd[860]:  notice: Node server1 
state is now member
Feb 15 09:43:26 server3 pacemaker-based[857]:  notice: Node server1 
state is now member
Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] This node is within 
the primary component and will provide service.

Feb 15 09:43:26 server3 corosync[568]:   [QUORUM] Members[3]: 1 2 3
Feb 15 09:43:26 server3 corosync[568]:   [MAIN  ] Completed service 
synchronization, ready to provide service.

Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Quorum acquired
Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Node server1 
state is now member
Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Node server2 
state is now member
Feb 15 09:43:26 server3 pacemaker-based[857]:  notice: Node server2 
state is now member
Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: Transition 0 
aborted: Peer Halt
Feb 15 09:43:26 server3 pacemaker-fenced[858]:  notice: Node server1 
state is now member
Feb 15 09:43:26 server3 pacemaker-controld[862]:  warning: Another DC 
detected: server2 (op=noop)
Feb 15 09:43:26 server3 pacemaker-fenced[858]:  notice: Node server2 
state is now member
Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: State 
transition S_ELECTION -> S_RELEASE_DC
Feb 15 09:43:26 server3 pacemaker-controld[862]:  warning: Cancelling 
timer for action 12 (src=67)
Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: No need to 
invoke the TE (A_TE_HALT) in state S_RELEASE_DC
Feb 15 09:43:26 server3 pacemaker-attrd[860]:  notice: Node server2 
state is now member
Feb 15 09:43:26 server3 pacemaker-controld[862]:  notice: State 
transition S_PENDING -> S_NOT_DC
Feb 15 09:43:27 server3 pacemaker-attrd[860]:  notice: Setting 
#attrd-protocol[server1]: (unset) -> 2
Feb 15 09:43:27 server3 pacemaker-attrd[860]:  notice: Detected another 
attribute writer (server2), starting new election
Feb 15 09:43:27 server3 pacemaker-attrd[860]:  notice: Setting 
#attrd-protocol[server2]: (unset) -> 2

Feb 15 09:43:27 server3 IPaddr2(Shared-IPv4)[1258]: INFO:
Feb 15 09:43:27 server3 ntpd[602]: Listen normally on 8 eth0 10.13.68.12:123
Feb 15 09:43:27 server3 ntpd[602]: new interface(s) found: waking up 
resolver
=> Feb 15 09:43:28 server3 pacemaker-controld[862]:  notice: Result of 
start operation for tomcat9 on server3: ok
Feb 15 09:43:29 server3 corosync[568]:   [KNET  ] pmtud: PMTUD link 
change for host: 2 link: 0 from 485 to 1397
Feb 15 09:43:29 server3 corosync[568]:   [KNET  ] pmtud: PMTUD link 
change for host: 1 link: 0 from 485 to 1397
Feb 15 09:43:29 server3 corosync[568]:   [KNET  ] pmtud: Global data MTU 
changed to: 1397
=> Feb 15 09:43:29 server3 pacemaker-controld[862]:  notice: Requesting 
local execution of stop operation for tomcat9 on server3



Any idea ?

___
Manage your s

Re: [ClusterLabs] Systemd resource started on node after reboot before cluster is stable ?

2023-02-16 Thread Adam Cecile

On 2/15/23 12:16, Andrei Borzenkov wrote:

On Wed, Feb 15, 2023 at 12:49 PM Adam Cecile  wrote:

Hello,

Just had some issue with unexpected server behavior after reboot. This node was 
powered off, so cluster was running fine with this tomcat9 resource running on 
a different machine.

After powering on this node again, it briefly started tomcat before joining the 
cluster and decided to stop it again. I'm not sure why.


Here is the systemctl status tomcat9 on this host:

tomcat9.service - Apache Tomcat 9 Web Application Server
  Loaded: loaded (/lib/systemd/system/tomcat9.service; disabled; vendor 
preset: enabled)
 Drop-In: /etc/systemd/system/tomcat9.service.d
  └─override.conf
  Active: inactive (dead)
Docs:https://tomcat.apache.org/tomcat-9.0-doc/index.html

Feb 15 09:43:27 server tomcat9[1398]: Starting service [Catalina]
Feb 15 09:43:27 server tomcat9[1398]: Starting Servlet engine: [Apache 
Tomcat/9.0.43 (Debian)]
Feb 15 09:43:27 server tomcat9[1398]: [...]
Feb 15 09:43:29 server systemd[1]: Stopping Apache Tomcat 9 Web Application 
Server...
Feb 15 09:43:29 server systemd[1]: tomcat9.service: Succeeded.
Feb 15 09:43:29 server systemd[1]: Stopped Apache Tomcat 9 Web Application 
Server.
Feb 15 09:43:29 server systemd[1]: tomcat9.service: Consumed 8.017s CPU time.

You can see it is disabled and should NOT be started

"Disabled" in systemd just means that links in [Install] section are
not present. This unit may be started by explicit request, or by
explicit dependency like Wants or Requires in another unit. Check
"systemctl show -p WantedBy -p RequiredBy tomcat9.service".


Sadly it is configured as it should:

RequiredBy=
WantedBy=
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Systemd resource started on node after reboot before cluster is stable ?

2023-02-15 Thread Adam Cecile

Hello,

Just had some issue with unexpected server behavior after reboot. This 
node was powered off, so cluster was running fine with this tomcat9 
resource running on a different machine.


After powering on this node again, it briefly started tomcat before 
joining the cluster and decided to stop it again. I'm not sure why.



Here is the systemctl status tomcat9 on this host:

tomcat9.service - Apache Tomcat 9 Web Application Server
 Loaded: loaded (/lib/systemd/system/tomcat9.service; disabled; 
vendor preset: enabled)

    Drop-In: /etc/systemd/system/tomcat9.service.d
 └─override.conf
 Active: inactive (dead)
   Docs: https://tomcat.apache.org/tomcat-9.0-doc/index.html

Feb 15 09:43:27 server tomcat9[1398]: Starting service [Catalina]
Feb 15 09:43:27 server tomcat9[1398]: Starting Servlet engine: [Apache 
Tomcat/9.0.43 (Debian)]

Feb 15 09:43:27 server tomcat9[1398]: [...]
Feb 15 09:43:29 server systemd[1]: Stopping Apache Tomcat 9 Web 
Application Server...

Feb 15 09:43:29 server systemd[1]: tomcat9.service: Succeeded.
Feb 15 09:43:29 server systemd[1]: Stopped Apache Tomcat 9 Web 
Application Server.
Feb 15 09:43:29 server systemd[1]: tomcat9.service: Consumed 8.017s CPU 
time.


You can see it is disabled and should NOT be started with the same, 
start/stop is under Corosync control



The systemd resource is defined like this:

primitive tomcat9 systemd:tomcat9.service \
    op start interval=0 timeout=120 \
    op stop interval=0 timeout=120 \
    op monitor interval=60 timeout=100


Any idea why this happened ?

Best regards, Adam.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Reload DNSMasq after IPAddr2 change ?

2023-02-09 Thread Adam Cecile

Hello,


I might be stupid but I'm completely stuck with this requirement. We 
just figured out DNSMasq proxy is not working correctly after shared IP 
address is moved from one host to another because it does not listen on 
the new address.


My need is to issue a reload statement to DNSMasq to make it work again 
but I failed to find anyone describing how to implement this so I guess 
I completely wrong.



Could someone explain me how I'm supposed to handle such situation ?


Best regards, Adam.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Automatic recover from split brain ?

2020-08-11 Thread Adam Cecile

Hello,


I'm experiencing issue with corosync/pacemaker running on Debian Buster. 
Cluster has three nodes running in VMWare virtual machine and the 
cluster fails when VEEAM backups the virtual machine (I know it's doing 
bad things, like freezing completely the VM for a few minutes to make 
disk snapshot).


My biggest issue is that once the backup has been completed, the cluster 
stays in split brain state, and I'd like it to heal itself. Here current 
status:



One node is isolated:

Stack: corosync
Current DC: host2.domain.com (version 2.0.1-9e909a5bdd) - partition 
WITHOUT quorum

Last updated: Sat Aug  8 11:59:46 2020
Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on 
host1.domain.com


3 nodes configured
6 resources configured

Online: [ host2.domain.com ]
OFFLINE: [ host3.domain.com host1.domain.com ]


Two others are seeing each others:

Stack: corosync
Current DC: host3.domain.com (version 2.0.1-9e909a5bdd) - partition with 
quorum

Last updated: Sat Aug  8 12:07:56 2020
Last change: Fri Jul 24 07:18:12 2020 by root via cibadmin on 
host1.domain.com


3 nodes configured
6 resources configured

Online: [ host3.domain.com host1.domain.com ]
OFFLINE: [ host2.domain.com ]


The problem is that one of the resources is a floating IP address which 
is currently assigned to two different hosts...



Can you help me configuring the cluster correctly so this cannot occurs ?


Thanks in advance,

Adam.



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/