Re: [ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks

2019-07-01 Thread Jan Pokorný
On 01/07/19 13:26 +0200, Ulrich Windl wrote:
 Jan Pokorný  schrieb am 27.06.2019 um 12:02
 in Nachricht <20190627100209.gf31...@redhat.com>:
>> On 25/06/19 12:20 ‑0500, Ken Gaillot wrote:
>>> On Tue, 2019‑06‑25 at 11:06 +, Somanath Jeeva wrote:
>>> Addressing the root cause, I'd first make sure corosync is running at
>>> real‑time priority (I forget the ps option, hopefully someone else can
>>> chime in).
>> 
>> In a standard Linux environment, I find this ultimately convenient:
>> 
>>   # chrt ‑p $(pidof corosync)
>>   pid 6789's current scheduling policy: SCHED_RR
>>   pid 6789's current scheduling priority: 99
> 
> To me this is like pushing a car that already has a running engine! If
> corosync does crazy things, this will make things worse (i.e. enhance
> crazyness).

Note that said invocation is read-only fetch of the process
characteristic.  So not sure about your parable, it shall
rather be compared to not paying full attention to driving
itself for a bit when checking the speedometer (the check
alone won't presumably run with such an ultimate scheduling
priority, so the interferences shall be fairly minimal, just
as with `ps` or whatever programmatic fetch of such data).

>> (requires util‑linux, procps‑ng)

-- 
Jan (Poki)


pgpZqBIWXgRnX.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks

2019-07-01 Thread Ulrich Windl
>>> Jan Pokorný  schrieb am 27.06.2019 um 12:02 in
Nachricht
<20190627100209.gf31...@redhat.com>:
> On 25/06/19 12:20 ‑0500, Ken Gaillot wrote:
>> On Tue, 2019‑06‑25 at 11:06 +, Somanath Jeeva wrote:
>> Addressing the root cause, I'd first make sure corosync is running at
>> real‑time priority (I forget the ps option, hopefully someone else can
>> chime in).
> 
> In a standard Linux environment, I find this ultimately convenient:
> 
>   # chrt ‑p $(pidof corosync)
>   pid 6789's current scheduling policy: SCHED_RR
>   pid 6789's current scheduling priority: 99

To me this is like pushing a car that already has a running engine! If
corosync does crazy things, this will make things worse (i.e. enhance
crazyness).

> 
> (requires util‑linux, procps‑ng)
> 
>> Another possibility would be to raise the corosync token
>> timeout to allow for a greater time before a split is declared.
> 
> This is the unavoidable trade‑off between limiting false positives
> (negligible glitches triggering the riot) vs. timely manner of
> detecting the actual node/interconnect failures.  Just meant to
> note it's not a one‑way street, deliberation given the circumstances
> needed.
> 
> ‑‑ 
> Jan (Poki)



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks

2019-07-01 Thread Ulrich Windl
>>> Somanath Jeeva  schrieb am 25.06.2019 um 13:06
in
Nachricht


> I have not configured fencing in our setup . However I would like to know if

> the split brain can be avoided when high CPU occurs. 

It seems you like to ride a bicycle with crossed arms while trying to avoid to
fall ;-)

> 
> 
> With Regards
> Somanath Thilak J
> 
> ‑Original Message‑
> From: Ken Gaillot  
> Sent: Monday, June 24, 2019 20:28
> To: Cluster Labs ‑ All topics related to open‑source clustering welcomed 
> ; Somanath Jeeva 
> Subject: Re: [ClusterLabs] Two node cluster goes into split brain scenario 
> during CPU intensive tasks
> 
> On Mon, 2019‑06‑24 at 08:52 +0200, Jan Friesse wrote:
>> Somanath,
>> 
>> > Hi All,
>> > 
>> > I have a two node cluster with multicast (udp) transport . The 
>> > multicast IP used in 224.1.1.1 .
>> 
>> Would you mind to give a try to UDPU (unicast)? For two node cluster 
>> there is going to be no difference in terms of speed/throughput.
>> 
>> > 
>> > Whenever there is a CPU intensive task the pcs cluster goes into 
>> > split brain scenario and doesn't recover automatically . We have to
> 
> In addition to others' comments: if fencing is enabled, split brain should 
> not be possible. Automatic recovery should work as long as fencing succeeds.

> With fencing disabled, split brain with no automatic recovery can definitely

> happen.
> 
>> > do a manual restart of services to bring both nodes online again. 
>> 
>> Before the nodes goes into split brain , the corosync log shows ,
>> > 
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> 
>> This is usually happening when:
>> ‑ multicast is somehow rate‑limited on switch side (configuration/bad 
>> switch implementation/...)
>> ‑ MTU of network is smaller than 1500 bytes and fragmentation is not 
>> allowed ‑> try reduce totem.netmtu
>> 
>> Regards,
>>Honza
>> 
>> 
>> > May 24 15:51:42 server1 corosync[4745]:  [TOTEM ] A processor 
>> > failed, forming new configuration.
>> > May 24 16:41:42 server1 corosync[4745]:  [TOTEM ] A new membership
>> > (10.241.31.12:29276) was formed. Members left: 1 May 24 16:41:42 
>> > server1 corosync[4745]:  [TOTEM ] Failed to receive the leave 
>> > message. failed: 1
>> > 
>> > Is there any way we can overcome this or this may be due to any 
>> > multicast issues in the network side.
>> > 
>> > With Regards
>> > Somanath Thilak J
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > ___
>> > Manage your subscription:
>> > https://protect2.fireeye.com/url?k=cf120bda‑9398df1b‑cf124b41‑863d9b 
>> > cb726f‑716d821bbcb5bd46&q=1&u=https%3A%2F%2Flists.clusterlabs.org%2F
>> > mailman%2Flistinfo%2Fusers
>> > 
>> > ClusterLabs home: 
>> > https://protect2.fireeye.com/url?k=eb2ec5bb‑b7a4117a‑eb2e8520‑863d9b 
>> > cb726f‑b47e1043056350cb&q=1&u=https%3A%2F%2Fwww.clusterlabs.org%2F
>> > 
>> 
>> ___
>> Manage your subscription:
>> https://protect2.fireeye.com/url?k=99a652fd‑c52c863c‑99a61266‑863d9bcb 
>> 726f‑72abff69ac96d9a3&q=1&u=https%3A%2F%2Flists.clusterlabs.org%2Fmail
>> man%2Flistinfo%2Fusers
>> 
>> ClusterLabs home: 
>> https://protect2.fireeye.com/url?k=d77f0141‑8bf5d580‑d77f41da‑863d9bcb 
>> 726f‑0762985c29a467ea&q=1&u=https%3A%2F%2Fwww.clusterlabs.org%2F
> ‑‑
> Ken Gaillot 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks

2019-07-01 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 24.06.2019 um 16:57 in
Nachricht
<95f51b52283d05bbd948e4508c406d7ccb64.ca...@redhat.com>:
> On Mon, 2019‑06‑24 at 08:52 +0200, Jan Friesse wrote:
>> Somanath,
>> 
>> > Hi All,
>> > 
>> > I have a two node cluster with multicast (udp) transport . The
>> > multicast IP used in 224.1.1.1 .
>> 
>> Would you mind to give a try to UDPU (unicast)? For two node cluster 
>> there is going to be no difference in terms of speed/throughput.
>> 
>> > 
>> > Whenever there is a CPU intensive task the pcs cluster goes into
>> > split brain scenario and doesn't recover automatically . We have to
> 
> In addition to others' comments: if fencing is enabled, split brain
> should not be possible. Automatic recovery should work as long as

---unless the fencing was caused by a persistent communication problem...

> fencing succeeds. With fencing disabled, split brain with no automatic
> recovery can definitely happen.
> 
>> > do a manual restart of services to bring both nodes online again. 
>> 
>> Before the nodes goes into split brain , the corosync log shows ,
>> > 
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> 
>> This is usually happening when:
>> ‑ multicast is somehow rate‑limited on switch side
>> (configuration/bad 
>> switch implementation/...)
>> ‑ MTU of network is smaller than 1500 bytes and fragmentation is not 
>> allowed ‑> try reduce totem.netmtu
>> 
>> Regards,
>>Honza
>> 
>> 
>> > May 24 15:51:42 server1 corosync[4745]:  [TOTEM ] A processor
>> > failed, forming new configuration.
>> > May 24 16:41:42 server1 corosync[4745]:  [TOTEM ] A new membership
>> > (10.241.31.12:29276) was formed. Members left: 1
>> > May 24 16:41:42 server1 corosync[4745]:  [TOTEM ] Failed to receive
>> > the leave message. failed: 1
>> > 
>> > Is there any way we can overcome this or this may be due to any
>> > multicast issues in the network side.
>> > 
>> > With Regards
>> > Somanath Thilak J
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > ___
>> > Manage your subscription:
>> > https://lists.clusterlabs.org/mailman/listinfo/users 
>> > 
>> > ClusterLabs home: https://www.clusterlabs.org/ 
>> > 
>> 
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
> ‑‑ 
> Ken Gaillot 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks

2019-07-01 Thread Ulrich Windl
>>> Jan Friesse  schrieb am 24.06.2019 um 08:52 in
Nachricht
:
> Somanath,
> 
>> Hi All,
>> 
>> I have a two node cluster with multicast (udp) transport . The multicast IP

> used in 224.1.1.1 .
> 
> Would you mind to give a try to UDPU (unicast)? For two node cluster 
> there is going to be no difference in terms of speed/throughput.

I think a better recommendation would be raising the timeouts of the corosync
protocol. I agree that the syslog message provides very little usefulk
information to the user: WHY does the retransmit list grow?

> 
>> 
>> Whenever there is a CPU intensive task the pcs cluster goes into split
brain 
> scenario and doesn't recover automatically . We have to do a manual restart

> of services to bring both nodes online again. 
> Before the nodes goes into split brain , the corosync log shows ,
>> 
>> May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List: 7c 7e
>> May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List: 7c 7e
>> May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List: 7c 7e
>> May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List: 7c 7e
>> May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List: 7c 7e
> 
> This is usually happening when:
> ‑ multicast is somehow rate‑limited on switch side (configuration/bad 
> switch implementation/...)
> ‑ MTU of network is smaller than 1500 bytes and fragmentation is not 
> allowed ‑> try reduce totem.netmtu
> 
> Regards,
>Honza
> 
> 
>> May 24 15:51:42 server1 corosync[4745]:  [TOTEM ] A processor failed, 
> forming new configuration.
>> May 24 16:41:42 server1 corosync[4745]:  [TOTEM ] A new membership 
> (10.241.31.12:29276) was formed. Members left: 1
>> May 24 16:41:42 server1 corosync[4745]:  [TOTEM ] Failed to receive the 
> leave message. failed: 1
>> 
>> Is there any way we can overcome this or this may be due to any multicast 
> issues in the network side.
>> 
>> With Regards
>> Somanath Thilak J
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/