Re: [ClusterLabs] corosync won't start after node failure

Murat Inal Sun, 06 Oct 2024 12:25:42 -0700

More progress on this issue;

I have noticed that a corosync start initiates PTR queries for all ofthe local IP addresses. My production cluster node has many:


 1. area0: 172.30.1.1/27
 2. ctd: 10.1.5.16/31
 3. dep: 10.1.4.2/24
 4. docker0: 172.17.0.1/16
 5. fast: 10.1.5.1/28
 6. gst: 192.168.5.1/24
 7. ha: 10.1.5.25/29
 8. inet: 100.64.64.10/29
 9. iscsi1: 10.1.8.195/28
10. iscsi2: 10.1.8.211/28
11. iscsi3: 10.1.8.227/28
12. knet: 10.1.5.33/28
13. lo0: 10.1.255.1/32
14. lo: 127.0.0.1/8
15. mgmt: 10.1.3.4/24
16. nfpeeringout: 10.1.102.64/31

I created entries at /etc/hosts for all of the above. Corosync freezeNEVER happened after that. I have two production clusters, 5 nodes intotal. I did the same for remaining nodes. Not a single freeze.

BTW, I deleted the (false) DNS=1.2.3.4 entry at/etc/systemd/resolved.conf. There is no "workaround" at clusterconfigurations.

Based on the above, I guess that corosync somehow crashes after anaccumulated period of PTR query timeouts. Please note that there is NOname server at the time of cluster launch. So there is no response tothese queries.

If you think this is a bug, please lead me on how to proceed forcreating a report.


Thanks,

On 9/12/24 00:27, Murat Inal wrote:

Hello Ken,

I think I have resolved the problem on my own.
Yes, right after the boot, corosync fails to come up. Problem appearsto be related to name resolution. I ran corosync foreground and did astack trace: corosync froze and strace output was suspicious with manyname resolution-like calls.
In my failing cluster, I am running containerized BIND9 for regularname resolution services. Both nodes are running systemd-resolved forlocalhost's name resolution. Below are relevant directives ofresolved.conf:
DNS=10.1.5.30
#DNS=1.2.3.4
#FallbackDNS=
10.1.5.30/29 is the virtual IP address for the nodes where BIND9 canbe queried. This VIP and BIND9 container are managed by pacemaker, soafter a reboot, node does NOT have the VIP and there is NO containerrunning.
When I changed the directives as;

#DNS=10.1.5.30
DNS=1.2.3.4
#FallbackDNS=
corosync runs perfectly, successful cluster launch follows. 1.2.3.4 isa false address. Node does NOT have a default route before clusterlaunch. Obviously node does NOT receive any replies to its namequeries while corosync is coming up. However, both nodes have a validaddress, 10.1.5.25/29 and 10.1.5.26/29 after a reboot. It is a factthat 10.1.5.24/29 subnet is locally attached at both nodes.
Last discovery to mention is that I monitored LOCAL name resolutionswhile corosync starts ("sudo resolvectl monitor"). Monitoringimmediately displayed PTR queries for ALL LOCAL IP addresses of the node.
Based on the above, my conclusion is -there is something going badwith name resolutions using non-existent VIP address-. In my firstmessage, I mentioned that I was only able to recover corosync byREINSTALLING it from the repo. In order to reinstall, I was settingthe default route and name server address (8.8.8.8) manually in orderto run an effective "apt reinstall corosync". Hence, I wasunintentionally configuring a DNS server for systemd-resolved. So itwas NOT about reinstalling corosync but letting systemd-resolved usesome non-local name server address.
I am using corosync/pacemaker for a couple of years in production,probably since Ubuntu Server release 21.10 and never encountered sucha problem until now. I wrote an ansible playbook to togglesystemd-resolved's DNS directive, however I think this glitch SHOULDNOT exist.
I will be glad if I receive comments on the above.

Regards,


On 8/20/24 21:55, Ken Gaillot wrote:
On Mon, 2024-08-19 at 12:58 +0300, Murat Inal wrote:
[Resending the below due to message format problem]


Dear List,

I have been running two different 3-node clusters for some time. I
am
having a fatal problem with corosync: After a node failure, rebooted
node does NOT start corosync.

Clusters;

   * All nodes are running Ubuntu Server 24.04
   * corosync is 3.1.7
   * corosync-qdevice is 3.0.3
   * pacemaker is 2.1.6
   * The third node at both clusters is a quorum device. Cluster is on
     ffsplit algorithm.
   * All nodes are baremetal & attached to a dedicated kronosnet
network.
   * STONITH is enabled in one of the clusters and disabled for the
other.

corosync & pacemaker service starts (systemd) are disabled. I am
starting any cluster with the command pcs cluster start.

corosync NEVER starts AFTER a node failure (node is rebooted). There
Do you mean that the first time you run "pcs cluster start" after a
node reboot, corosync does not come up completely?

Try adding "debug: on" to the logging section of
/etc/corosync/corosync.conf
is
nothing in /var/log/corosync/corosync.log, service freezes as:

Aug 01 12:54:56 [3193] charon corosync notice  [MAIN  ] Corosync
Cluster
Engine 3.1.7 starting up
Aug 01 12:54:56 [3193] charon corosync info    [MAIN  ] Corosync
built-in features: dbus monitoring watchdog augeas systemd xmlconf
vqsim
nozzle snmp pie relro bindnow

corosync never starts kronosnet. I checked kronosnet interfaces, all
OK,
there is IP connectivity in between. If I do corosync -t, it is the
same
freeze.

I could ONLY manage to start corosync by reinstalling it: apt
reinstall
corosync ; pcs cluster start.

The above issue repeated itself at least 5-6 times. I do NOT see
anything in syslog either. I will be glad if you lead me on how to
solve
this.

Thanks,

Murat
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] corosync won't start after node failure

Reply via email to