Jan Friesse <jfrie...@redhat.com> writes: > wf...@niif.hu writes: > >> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day >> (in August; in May, it happened 0-2 times a day only, it's slowly >> ramping up): >> >> vhbl08 corosync[3687]: [TOTEM ] A processor failed, forming new >> configuration. >> vhbl03 corosync[3890]: [TOTEM ] A processor failed, forming new >> configuration. >> vhbl07 corosync[3805]: [MAIN ] Corosync main process was not scheduled >> for 4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout >> increase. > > ^^^ This is main problem you have to solve. It usually means that > machine is too overloaded. It is happening quite often when corosync > is running inside VM where host machine is unable to schedule regular > VM running.
Hi Honza, Corosync isn't running in a VM here, these nodes are 2x8 core servers hosting VMs themselves as Pacemaker resources. (Incidentally, some of these VMs run Corosync to form a test cluster, but that should be irrelevant now.) And they aren't overloaded in any apparent way: Munin reports 2900% CPU idle (out of 32 hyperthreads). There's no swap, but the corosync process is locked into memory anyway. It's also running as SCHED_RR prio 99, competing only with multipathd and the SCHED_FIFO prio 99 kernel threads (migration/* and watchdog/*) under Linux 4.9. I'll try to take a closer look at the scheduling of these. Can you recommend some indicators to check out? Are scheduling delays expected to generate TOTEM membership "changes" without any leaving and joining nodes? > As a start you can try what message say = Consider token timeout > increase. Currently you have 3 seconds, in theory 6 second should be > enough. OK, thanks for the tip. Can I do this on-line, without shutting down Corosync? -- Thanks, Feri _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org