
I've followed several tutorials about setting up a simple three-node
cluster, with no resources (yet), under CentOS 7.

I've discovered the cluster won't restart upon rebooting a node.

The other two nodes, however, do claim the cluster is up, as shown
with 'pcs status cluster'.

I tracked down that on the rebooted node, corosync exited with a
'0' status.  Nothing outright seems to be what I would call an error
message, but this was recorded:

   [MAIN  ] Corosync main process was not scheduled for 2145.7053
   ms (threshold is 1320.0000 ms). Consider token timeout increase.

This seems related:

   High Availability cluster node logs the message "Corosync main
   process was not scheduled for X ms (threshold is Y ms). Consider
   token timeout increase."

I've confirmed that corosync is running with the maximum realtime
scheduling priority:

   [root@node1 ~]# ps -eo cmd,rtprio | grep -e [c]orosync -e RTPRIO
   CMD                         RTPRIO
   corosync                        99

I am doing my testing in an admittedly underprovisioned VM environment.

I've used this same environment for CentOS 6 / heartbeat-based
solutions, and they were nowhere near as sensitive to these timing

Manually running 'pcs cluster start' does indeed fire everything
up without a hitch, and remains running for days at a crack.

The 'consider token timeout increase' message has me looking at this:

Which makes this assertion:

   RHEL 7 or 8

   If no token value is specified in the corosync configuration, the
   default is 1000 ms, or 1 second for a 2 node cluster, increasing
   by 650ms for each additional member.

I have a three-node cluster, and the arithmetic for totem.token
seems to hold:

   [root@node3 ~]# corosync-cmapctl | grep totem.token
   runtime.config.totem.token (u32) = 1650
   runtime.config.totem.token_retransmit (u32) = 392
   runtime.config.totem.token_retransmits_before_loss_const (u32) = 4

I'm confused on a number of issues:

- The 'totem.token' value of 1650 doesn't seem to related to the
   threshold number in the diagnostic message the corosync service

     threshold is 1320.0000 ms

   Can someone explain the relationship between these values?

Yes. Threshold is 80% of used token timeout.

- If I manually set 'totem.token' to a higher value, am I responsible
   for tracking the number of nodes in the cluster, to keep in
   alignment with what Red Hat's page says?

Nope. I've tried to explain what is really happening in the manpage corosync.conf(5). totem.token and totem.token_coefficient are used in the following formula:

runtime.config.token = totem.token + (number_of_nodes - 2) * totem.token_coefficient

Corosync used runtime.config.token.

- Under these conditions, when corosync exits, why does it do so
   with a zero status? It seems to me that if it exited at all,

That's a good question. How reproducible is the issue? Corosync shouldn't "exit" with zero status.

   without someone controllably stopping the service, it warrants a
   non-zero status.

- Is there a recommended way to alter either pacemaker/corosync or
   systemd configuration of these services to harden against resource

Enlarging timeout seems like a right way to go.

   I don't know if corosync's startup can be deferred until the CPU
   load settles, or if the some automatic retry can be set up...

This seems more like a init system question.


Details of my environment; I'm happy to provide others, if anyone
has any specific questions:

   [root@node1 ~]# cat /etc/centos-release
   CentOS Linux release 7.6.1810 (Core)
   [root@node1 ~]# rpm -qa | egrep 'pacemaker|corosync'

Manage your subscription:

ClusterLabs home:

Reply via email to