More progress on this issue;

I have noticed that a corosync start initiates PTR queries for all of the local IP addresses. My production cluster node has many:

 1. area0: 172.30.1.1/27
 2. ctd: 10.1.5.16/31
 3. dep: 10.1.4.2/24
 4. docker0: 172.17.0.1/16
 5. fast: 10.1.5.1/28
 6. gst: 192.168.5.1/24
 7. ha: 10.1.5.25/29
 8. inet: 100.64.64.10/29
 9. iscsi1: 10.1.8.195/28
10. iscsi2: 10.1.8.211/28
11. iscsi3: 10.1.8.227/28
12. knet: 10.1.5.33/28
13. lo0: 10.1.255.1/32
14. lo: 127.0.0.1/8
15. mgmt: 10.1.3.4/24
16. nfpeeringout: 10.1.102.64/31

I created entries at /etc/hosts for all of the above. Corosync freeze NEVER happened after that. I have two production clusters, 5 nodes in total. I did the same for remaining nodes. Not a single freeze.

BTW, I deleted the (false) DNS=1.2.3.4 entry at /etc/systemd/resolved.conf. There is no "workaround" at cluster configurations.

Based on the above, I guess that corosync somehow crashes after an accumulated period of PTR query timeouts. Please note that there is NO name server at the time of cluster launch. So there is no response to these queries.

If you think this is a bug, please lead me on how to proceed for creating a report.

Thanks,

On 9/12/24 00:27, Murat Inal wrote:
Hello Ken,

I think I have resolved the problem on my own.

Yes, right after the boot, corosync fails to come up. Problem appears to be related to name resolution. I ran corosync foreground and did a stack trace: corosync froze and strace output was suspicious with many name resolution-like calls.

In my failing cluster, I am running containerized BIND9 for regular name resolution services. Both nodes are running systemd-resolved for localhost's name resolution. Below are relevant directives of resolved.conf:

DNS=10.1.5.30
#DNS=1.2.3.4
#FallbackDNS=

10.1.5.30/29 is the virtual IP address for the nodes where BIND9 can be queried. This VIP and BIND9 container are managed by pacemaker, so after a reboot, node does NOT have the VIP and there is NO container running.

When I changed the directives as;

#DNS=10.1.5.30
DNS=1.2.3.4
#FallbackDNS=

corosync runs perfectly, successful cluster launch follows. 1.2.3.4 is a false address. Node does NOT have a default route before cluster launch. Obviously node does NOT receive any replies to its name queries while corosync is coming up. However, both nodes have a valid address, 10.1.5.25/29 and 10.1.5.26/29 after a reboot. It is a fact that 10.1.5.24/29 subnet is locally attached at both nodes.

Last discovery to mention is that I monitored LOCAL name resolutions while corosync starts ("sudo resolvectl monitor"). Monitoring immediately displayed PTR queries for ALL LOCAL IP addresses of the node.

Based on the above, my conclusion is -there is something going bad with name resolutions using non-existent VIP address-. In my first message, I mentioned that I was only able to recover corosync by REINSTALLING it from the repo. In order to reinstall, I was setting the default route and name server address (8.8.8.8) manually in order to run an effective "apt reinstall corosync". Hence, I was unintentionally configuring a DNS server for systemd-resolved. So it was NOT about reinstalling corosync but letting systemd-resolved use some non-local name server address.

I am using corosync/pacemaker for a couple of years in production, probably since Ubuntu Server release 21.10 and never encountered such a problem until now. I wrote an ansible playbook to toggle systemd-resolved's DNS directive, however I think this glitch SHOULD NOT exist.

I will be glad if I receive comments on the above.

Regards,


On 8/20/24 21:55, Ken Gaillot wrote:
On Mon, 2024-08-19 at 12:58 +0300, Murat Inal wrote:
[Resending the below due to message format problem]


Dear List,

I have been running two different 3-node clusters for some time. I
am
having a fatal problem with corosync: After a node failure, rebooted
node does NOT start corosync.

Clusters;

   * All nodes are running Ubuntu Server 24.04
   * corosync is 3.1.7
   * corosync-qdevice is 3.0.3
   * pacemaker is 2.1.6
   * The third node at both clusters is a quorum device. Cluster is on
     ffsplit algorithm.
   * All nodes are baremetal & attached to a dedicated kronosnet
network.
   * STONITH is enabled in one of the clusters and disabled for the
other.

corosync & pacemaker service starts (systemd) are disabled. I am
starting any cluster with the command pcs cluster start.

corosync NEVER starts AFTER a node failure (node is rebooted). There
Do you mean that the first time you run "pcs cluster start" after a
node reboot, corosync does not come up completely?

Try adding "debug: on" to the logging section of
/etc/corosync/corosync.conf

is
nothing in /var/log/corosync/corosync.log, service freezes as:

Aug 01 12:54:56 [3193] charon corosync notice  [MAIN  ] Corosync
Cluster
Engine 3.1.7 starting up
Aug 01 12:54:56 [3193] charon corosync info    [MAIN  ] Corosync
built-in features: dbus monitoring watchdog augeas systemd xmlconf
vqsim
nozzle snmp pie relro bindnow

corosync never starts kronosnet. I checked kronosnet interfaces, all
OK,
there is IP connectivity in between. If I do corosync -t, it is the
same
freeze.

I could ONLY manage to start corosync by reinstalling it: apt
reinstall
corosync ; pcs cluster start.

The above issue repeated itself at least 5-6 times. I do NOT see
anything in syslog either. I will be glad if you lead me on how to
solve
this.

Thanks,

Murat

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to