On Fri, Mar 14, 2025 at 12:48 PM chenzu...@gmail.com <chenzu...@gmail.com> wrote: > > > Background: > There are 11 physical machines, with two virtual machines running on each > physical machine. > lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the > Lustre OSS service. > Each virtual machine is directly connected to two network interfaces, > service1 and service2. > Pacemaker is used to ensure high availability of the Lustre services. > lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8) > > Issue: During testing, the network interface service1 on lustre-oss-node30 > and lustre-oss-node40 was repeatedly brought up and down every 1 second (to > simulate a network failure). > The Corosync logs showed that heartbeats were lost, triggering a fencing > action that powered off the nodes with lost heartbeats. > Given that Corosync is configured with redundant networks, why did the > heartbeat loss occur? Is it due to a configuration issue, or is Corosync not > designed to handle this scenario?
I cannot answer this question, but the common advice on this list was to *not* test by bringing an interface down but by blocking communication, e.g. using netfilter (iptables/nftables). > > Other: > The configuration of corosync.conf can be found in the attached file > corosync.conf. > Other relevant information is available in the attached file log.txt. > The script used for the up/down testing is attached as ip_up_and_down.sh. > > > > ________________________________ > chenzu...@gmail.com > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/