Jan Friesse <jfrie...@redhat.com> writes: > wf...@niif.hu writes: > >> Jan Friesse <jfrie...@redhat.com> writes: >> >>> wf...@niif.hu writes: >>> >>>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day >>>> (in August; in May, it happened 0-2 times a day only, it's slowly >>>> ramping up): >>>> >>>> vhbl08 corosync[3687]: [TOTEM ] A processor failed, forming new >>>> configuration. >>>> vhbl03 corosync[3890]: [TOTEM ] A processor failed, forming new >>>> configuration. >>>> vhbl07 corosync[3805]: [MAIN ] Corosync main process was not scheduled >>>> for 4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout >>>> increase. >>> >>> ^^^ This is main problem you have to solve. It usually means that >>> machine is too overloaded. It is happening quite often when corosync >>> is running inside VM where host machine is unable to schedule regular >>> VM running. >> >> Corosync isn't running in a VM here, these nodes are 2x8 core servers >> hosting VMs themselves as Pacemaker resources. (Incidentally, some of >> these VMs run Corosync to form a test cluster, but that should be >> irrelevant now.) And they aren't overloaded in any apparent way: Munin >> reports 2900% CPU idle (out of 32 hyperthreads). There's no swap, but >> the corosync process is locked into memory anyway. It's also running as >> SCHED_RR prio 99, competing only with multipathd and the SCHED_FIFO prio >> 99 kernel threads (migration/* and watchdog/*) under Linux 4.9. I'll >> try to take a closer look at the scheduling of these. Can you recommend >> some indicators to check out? > > No real hints. But one question. Are you 100% sure memory is locked? > Because we had problem where mlockall was called in wrong place so > corosync was actually not locked and it was causing similar issues. > > This behavior is fixed by > https://github.com/corosync/corosync/commit/238e2e62d8b960e7c10bfa0a8281d78ec99f3a26
I based this assertion on the L flag in the ps STAT column. The above commit should not affect me because I'm running corosync with the -f option: $ ps l 3805 F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 4 0 3805 1 -100 - 247464 141016 - SLsl ? 251:10 /usr/sbin/corosync -f By the way, are the above VSZ and RSS numbers reasonable? One more thing: these servers run without any swap. >>> As a start you can try what message say = Consider token timeout >>> increase. Currently you have 3 seconds, in theory 6 second should be >>> enough. >> >> OK, thanks for the tip. Can I do this on-line, without shutting down >> Corosync? > > Corosync way is to just edit/copy corosync.conf on all nodes and call > corosync-cfgtool -R on one of the nodes (crmsh/pcs may have better > way). Great, that's what I wanted to know: whether -R is expected to make this change effective. -- Thanks, Feri _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org