On Fri, Mar 14, 2025 at 12:48 PM chenzu...@gmail.com
<chenzu...@gmail.com> wrote:
>
>
> Background:
> There are 11 physical machines, with two virtual machines running on each 
> physical machine.
> lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the 
> Lustre OSS service.
> Each virtual machine is directly connected to two network interfaces, 
> service1 and service2.
> Pacemaker is used to ensure high availability of the Lustre services.
> lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8)
>
> Issue: During testing, the network interface service1 on lustre-oss-node30 
> and lustre-oss-node40 was repeatedly brought up and down every 1 second (to 
> simulate a network failure).
> The Corosync logs showed that heartbeats were lost, triggering a fencing 
> action that powered off the nodes with lost heartbeats.
> Given that Corosync is configured with redundant networks, why did the 
> heartbeat loss occur? Is it due to a configuration issue, or is Corosync not 
> designed to handle this scenario?

I cannot answer this question, but the common advice on this list was
to *not* test by bringing an interface down but by blocking
communication, e.g. using netfilter (iptables/nftables).

>
> Other:
> The configuration of corosync.conf can be found in the attached file 
> corosync.conf.
> Other relevant information is available in the attached file log.txt.
> The script used for the up/down testing is attached as ip_up_and_down.sh.
>
>
>
> ________________________________
> chenzu...@gmail.com
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to