I have not configured fencing in our setup . However I would like to know if the split brain can be avoided when high CPU occurs.
With Regards Somanath Thilak J -----Original Message----- From: Ken Gaillot <[email protected]> Sent: Monday, June 24, 2019 20:28 To: Cluster Labs - All topics related to open-source clustering welcomed <[email protected]>; Somanath Jeeva <[email protected]> Subject: Re: [ClusterLabs] Two node cluster goes into split brain scenario during CPU intensive tasks On Mon, 2019-06-24 at 08:52 +0200, Jan Friesse wrote: > Somanath, > > > Hi All, > > > > I have a two node cluster with multicast (udp) transport . The > > multicast IP used in 224.1.1.1 . > > Would you mind to give a try to UDPU (unicast)? For two node cluster > there is going to be no difference in terms of speed/throughput. > > > > > Whenever there is a CPU intensive task the pcs cluster goes into > > split brain scenario and doesn't recover automatically . We have to In addition to others' comments: if fencing is enabled, split brain should not be possible. Automatic recovery should work as long as fencing succeeds. With fencing disabled, split brain with no automatic recovery can definitely happen. > > do a manual restart of services to bring both nodes online again. > > Before the nodes goes into split brain , the corosync log shows , > > > > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: > > 7c 7e > > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: > > 7c 7e > > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: > > 7c 7e > > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: > > 7c 7e > > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: > > 7c 7e > > This is usually happening when: > - multicast is somehow rate-limited on switch side (configuration/bad > switch implementation/...) > - MTU of network is smaller than 1500 bytes and fragmentation is not > allowed -> try reduce totem.netmtu > > Regards, > Honza > > > > May 24 15:51:42 server1 corosync[4745]: [TOTEM ] A processor > > failed, forming new configuration. > > May 24 16:41:42 server1 corosync[4745]: [TOTEM ] A new membership > > (10.241.31.12:29276) was formed. Members left: 1 May 24 16:41:42 > > server1 corosync[4745]: [TOTEM ] Failed to receive the leave > > message. failed: 1 > > > > Is there any way we can overcome this or this may be due to any > > multicast issues in the network side. > > > > With Regards > > Somanath Thilak J > > > > > > > > > > > > > > > > _______________________________________________ > > Manage your subscription: > > https://protect2.fireeye.com/url?k=cf120bda-9398df1b-cf124b41-863d9b > > cb726f-716d821bbcb5bd46&q=1&u=https%3A%2F%2Flists.clusterlabs.org%2F > > mailman%2Flistinfo%2Fusers > > > > ClusterLabs home: > > https://protect2.fireeye.com/url?k=eb2ec5bb-b7a4117a-eb2e8520-863d9b > > cb726f-b47e1043056350cb&q=1&u=https%3A%2F%2Fwww.clusterlabs.org%2F > > > > _______________________________________________ > Manage your subscription: > https://protect2.fireeye.com/url?k=99a652fd-c52c863c-99a61266-863d9bcb > 726f-72abff69ac96d9a3&q=1&u=https%3A%2F%2Flists.clusterlabs.org%2Fmail > man%2Flistinfo%2Fusers > > ClusterLabs home: > https://protect2.fireeye.com/url?k=d77f0141-8bf5d580-d77f41da-863d9bcb > 726f-0762985c29a467ea&q=1&u=https%3A%2F%2Fwww.clusterlabs.org%2F -- Ken Gaillot <[email protected]> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
