Re: [ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks
On 01/07/19 13:26 +0200, Ulrich Windl wrote: Jan Pokorný schrieb am 27.06.2019 um 12:02 in Nachricht <20190627100209.gf31...@redhat.com>: >> On 25/06/19 12:20 ‑0500, Ken Gaillot wrote: >>> On Tue, 2019‑06‑25 at 11:06 +, Somanath Jeeva wrote: >>> Addressing the root cause, I'd first make sure corosync is running at >>> real‑time priority (I forget the ps option, hopefully someone else can >>> chime in). >> >> In a standard Linux environment, I find this ultimately convenient: >> >> # chrt ‑p $(pidof corosync) >> pid 6789's current scheduling policy: SCHED_RR >> pid 6789's current scheduling priority: 99 > > To me this is like pushing a car that already has a running engine! If > corosync does crazy things, this will make things worse (i.e. enhance > crazyness). Note that said invocation is read-only fetch of the process characteristic. So not sure about your parable, it shall rather be compared to not paying full attention to driving itself for a bit when checking the speedometer (the check alone won't presumably run with such an ultimate scheduling priority, so the interferences shall be fairly minimal, just as with `ps` or whatever programmatic fetch of such data). >> (requires util‑linux, procps‑ng) -- Jan (Poki) pgpZqBIWXgRnX.pgp Description: PGP signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks
>>> Jan Pokorný schrieb am 27.06.2019 um 12:02 in Nachricht <20190627100209.gf31...@redhat.com>: > On 25/06/19 12:20 ‑0500, Ken Gaillot wrote: >> On Tue, 2019‑06‑25 at 11:06 +, Somanath Jeeva wrote: >> Addressing the root cause, I'd first make sure corosync is running at >> real‑time priority (I forget the ps option, hopefully someone else can >> chime in). > > In a standard Linux environment, I find this ultimately convenient: > > # chrt ‑p $(pidof corosync) > pid 6789's current scheduling policy: SCHED_RR > pid 6789's current scheduling priority: 99 To me this is like pushing a car that already has a running engine! If corosync does crazy things, this will make things worse (i.e. enhance crazyness). > > (requires util‑linux, procps‑ng) > >> Another possibility would be to raise the corosync token >> timeout to allow for a greater time before a split is declared. > > This is the unavoidable trade‑off between limiting false positives > (negligible glitches triggering the riot) vs. timely manner of > detecting the actual node/interconnect failures. Just meant to > note it's not a one‑way street, deliberation given the circumstances > needed. > > ‑‑ > Jan (Poki) ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks
>>> Somanath Jeeva schrieb am 25.06.2019 um 13:06 in Nachricht > I have not configured fencing in our setup . However I would like to know if > the split brain can be avoided when high CPU occurs. It seems you like to ride a bicycle with crossed arms while trying to avoid to fall ;-) > > > With Regards > Somanath Thilak J > > ‑Original Message‑ > From: Ken Gaillot > Sent: Monday, June 24, 2019 20:28 > To: Cluster Labs ‑ All topics related to open‑source clustering welcomed > ; Somanath Jeeva > Subject: Re: [ClusterLabs] Two node cluster goes into split brain scenario > during CPU intensive tasks > > On Mon, 2019‑06‑24 at 08:52 +0200, Jan Friesse wrote: >> Somanath, >> >> > Hi All, >> > >> > I have a two node cluster with multicast (udp) transport . The >> > multicast IP used in 224.1.1.1 . >> >> Would you mind to give a try to UDPU (unicast)? For two node cluster >> there is going to be no difference in terms of speed/throughput. >> >> > >> > Whenever there is a CPU intensive task the pcs cluster goes into >> > split brain scenario and doesn't recover automatically . We have to > > In addition to others' comments: if fencing is enabled, split brain should > not be possible. Automatic recovery should work as long as fencing succeeds. > With fencing disabled, split brain with no automatic recovery can definitely > happen. > >> > do a manual restart of services to bring both nodes online again. >> >> Before the nodes goes into split brain , the corosync log shows , >> > >> > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: >> > 7c 7e >> > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: >> > 7c 7e >> > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: >> > 7c 7e >> > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: >> > 7c 7e >> > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: >> > 7c 7e >> >> This is usually happening when: >> ‑ multicast is somehow rate‑limited on switch side (configuration/bad >> switch implementation/...) >> ‑ MTU of network is smaller than 1500 bytes and fragmentation is not >> allowed ‑> try reduce totem.netmtu >> >> Regards, >>Honza >> >> >> > May 24 15:51:42 server1 corosync[4745]: [TOTEM ] A processor >> > failed, forming new configuration. >> > May 24 16:41:42 server1 corosync[4745]: [TOTEM ] A new membership >> > (10.241.31.12:29276) was formed. Members left: 1 May 24 16:41:42 >> > server1 corosync[4745]: [TOTEM ] Failed to receive the leave >> > message. failed: 1 >> > >> > Is there any way we can overcome this or this may be due to any >> > multicast issues in the network side. >> > >> > With Regards >> > Somanath Thilak J >> > >> > >> > >> > >> > >> > >> > >> > ___ >> > Manage your subscription: >> > https://protect2.fireeye.com/url?k=cf120bda‑9398df1b‑cf124b41‑863d9b >> > cb726f‑716d821bbcb5bd46&q=1&u=https%3A%2F%2Flists.clusterlabs.org%2F >> > mailman%2Flistinfo%2Fusers >> > >> > ClusterLabs home: >> > https://protect2.fireeye.com/url?k=eb2ec5bb‑b7a4117a‑eb2e8520‑863d9b >> > cb726f‑b47e1043056350cb&q=1&u=https%3A%2F%2Fwww.clusterlabs.org%2F >> > >> >> ___ >> Manage your subscription: >> https://protect2.fireeye.com/url?k=99a652fd‑c52c863c‑99a61266‑863d9bcb >> 726f‑72abff69ac96d9a3&q=1&u=https%3A%2F%2Flists.clusterlabs.org%2Fmail >> man%2Flistinfo%2Fusers >> >> ClusterLabs home: >> https://protect2.fireeye.com/url?k=d77f0141‑8bf5d580‑d77f41da‑863d9bcb >> 726f‑0762985c29a467ea&q=1&u=https%3A%2F%2Fwww.clusterlabs.org%2F > ‑‑ > Ken Gaillot > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks
>>> Ken Gaillot schrieb am 24.06.2019 um 16:57 in Nachricht <95f51b52283d05bbd948e4508c406d7ccb64.ca...@redhat.com>: > On Mon, 2019‑06‑24 at 08:52 +0200, Jan Friesse wrote: >> Somanath, >> >> > Hi All, >> > >> > I have a two node cluster with multicast (udp) transport . The >> > multicast IP used in 224.1.1.1 . >> >> Would you mind to give a try to UDPU (unicast)? For two node cluster >> there is going to be no difference in terms of speed/throughput. >> >> > >> > Whenever there is a CPU intensive task the pcs cluster goes into >> > split brain scenario and doesn't recover automatically . We have to > > In addition to others' comments: if fencing is enabled, split brain > should not be possible. Automatic recovery should work as long as ---unless the fencing was caused by a persistent communication problem... > fencing succeeds. With fencing disabled, split brain with no automatic > recovery can definitely happen. > >> > do a manual restart of services to bring both nodes online again. >> >> Before the nodes goes into split brain , the corosync log shows , >> > >> > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: >> > 7c 7e >> > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: >> > 7c 7e >> > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: >> > 7c 7e >> > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: >> > 7c 7e >> > May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: >> > 7c 7e >> >> This is usually happening when: >> ‑ multicast is somehow rate‑limited on switch side >> (configuration/bad >> switch implementation/...) >> ‑ MTU of network is smaller than 1500 bytes and fragmentation is not >> allowed ‑> try reduce totem.netmtu >> >> Regards, >>Honza >> >> >> > May 24 15:51:42 server1 corosync[4745]: [TOTEM ] A processor >> > failed, forming new configuration. >> > May 24 16:41:42 server1 corosync[4745]: [TOTEM ] A new membership >> > (10.241.31.12:29276) was formed. Members left: 1 >> > May 24 16:41:42 server1 corosync[4745]: [TOTEM ] Failed to receive >> > the leave message. failed: 1 >> > >> > Is there any way we can overcome this or this may be due to any >> > multicast issues in the network side. >> > >> > With Regards >> > Somanath Thilak J >> > >> > >> > >> > >> > >> > >> > >> > ___ >> > Manage your subscription: >> > https://lists.clusterlabs.org/mailman/listinfo/users >> > >> > ClusterLabs home: https://www.clusterlabs.org/ >> > >> >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > ‑‑ > Ken Gaillot > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks
>>> Jan Friesse schrieb am 24.06.2019 um 08:52 in Nachricht : > Somanath, > >> Hi All, >> >> I have a two node cluster with multicast (udp) transport . The multicast IP > used in 224.1.1.1 . > > Would you mind to give a try to UDPU (unicast)? For two node cluster > there is going to be no difference in terms of speed/throughput. I think a better recommendation would be raising the timeouts of the corosync protocol. I agree that the syslog message provides very little usefulk information to the user: WHY does the retransmit list grow? > >> >> Whenever there is a CPU intensive task the pcs cluster goes into split brain > scenario and doesn't recover automatically . We have to do a manual restart > of services to bring both nodes online again. > Before the nodes goes into split brain , the corosync log shows , >> >> May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: 7c 7e >> May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: 7c 7e >> May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: 7c 7e >> May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: 7c 7e >> May 24 15:10:02 server1 corosync[4745]: [TOTEM ] Retransmit List: 7c 7e > > This is usually happening when: > ‑ multicast is somehow rate‑limited on switch side (configuration/bad > switch implementation/...) > ‑ MTU of network is smaller than 1500 bytes and fragmentation is not > allowed ‑> try reduce totem.netmtu > > Regards, >Honza > > >> May 24 15:51:42 server1 corosync[4745]: [TOTEM ] A processor failed, > forming new configuration. >> May 24 16:41:42 server1 corosync[4745]: [TOTEM ] A new membership > (10.241.31.12:29276) was formed. Members left: 1 >> May 24 16:41:42 server1 corosync[4745]: [TOTEM ] Failed to receive the > leave message. failed: 1 >> >> Is there any way we can overcome this or this may be due to any multicast > issues in the network side. >> >> With Regards >> Somanath Thilak J >> >> >> >> >> >> >> >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/