Re: [ClusterLabs] Antw: [EXT] Re: Questions about the infamous TOTEM retransmit list

Roger Zhou Wed, 13 Jan 2021 05:34:46 -0800


On 1/13/21 3:31 PM, Ulrich Windl wrote:

Roger Zhou <zz...@suse.com> schrieb am 13.01.2021 um 05:32 in Nachricht

<97ac2305-85b4-cbb0-7133-ac1372143...@suse.com>:

On 1/12/21 4:23 PM, Ulrich Windl wrote:

Hi!

Before setting up our first pacemaker cluster we thought one low-speed

redundant network would be good in addition to the normal high-speed network.

However as is seems now (SLES15 SP2) there is NO reasonable RRP mode to

drive such a configuration with corosync.


Passive RRP mode with UDPU still sends each packet through both nets,


Indeed, packets are sent in the round-robin fashion.

being throttled by the slower network.
(Originally we were using multicast, but that was even worse)

Now I realized that even under modest load, I see messages about "retransmit

list", like this:

Jan 08 10:57:56 h16 corosync[3562]:   [TOTEM ] Retransmit List: 3e2
Jan 08 10:57:56 h16 corosync[3562]:   [TOTEM ] Retransmit List: 3e2 3e4
Jan 08 11:13:21 h16 corosync[3562]:   [TOTEM ] Retransmit List: 60e 610 612

Jan 08 11:13:21 h16 corosync[3562]:   [TOTEM ] Retransmit List: 610 614
Jan 08 11:13:21 h16 corosync[3562]:   [TOTEM ] Retransmit List: 614
Jan 08 11:13:41 h16 corosync[3562]:   [TOTEM ] Retransmit List: 6ed


What's the latency of this low speed link?


The normal net is fibre-based:
4 packets transmitted, 4 received, 0% packet loss, time 3058ms
rtt min/avg/max/mdev = 0.131/0.175/0.205/0.027 ms

The redundant net is copper-based:
5 packets transmitted, 5 received, 0% packet loss, time 4104ms
rtt min/avg/max/mdev = 0.293/0.304/0.325/0.019 ms

Aha, RTT < 1ms, the network is fast enough. It clear my doubt to guess thelatency of the slow link might even in tens or even hundred ms level. Then, Imight wonder if corosync packet get the bad luck and get delayed due toworkload on one of the link.

Questions on that:
Will the situation be much better with knet?


knet provides "link_mode: passive" could fit your thought slightly which is
not
round-robin. But, it still doesn't fit your game well, since knet assumes
the
similar latency among links again. You may have to tune parameters for the
low
speed link and likely sacrifice the benefit from the fast link.


Well in the past when using HP Service Guard, everything was working quite 
differently:
There was a true heartbeat on each cluster net, determining ist "being alive", 
and when the cluster performed no action there was no traffic on the cluster links 
(except that heartbeat).
When the cluster actually had to talk, it was using the link that was flagged 
"alive" with a preference of primary first, then secondary when both were 
available.

"link_mode: passive" together with knet_link_priority would be useful. Also,use sctp in knet could be the alternative too.


Cheers,
Roger

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Re: Questions about the infamous TOTEM retransmit list

Reply via email to