Ferenc,

wf...@niif.hu (Ferenc Wágner) writes:

Jan Friesse <jfrie...@redhat.com> writes:

wf...@niif.hu writes:

In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
(in August; in May, it happened 0-2 times a day only, it's slowly
ramping up):

vhbl08 corosync[3687]:   [TOTEM ] A processor failed, forming new configuration.
vhbl03 corosync[3890]:   [TOTEM ] A processor failed, forming new configuration.
vhbl07 corosync[3805]:   [MAIN  ] Corosync main process was not scheduled for 
4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout increase.

^^^ This is main problem you have to solve. It usually means that
machine is too overloaded. It is happening quite often when corosync
is running inside VM where host machine is unable to schedule regular
VM running.

After some extensive tracing, I think the problem lies elsewhere: my
IPMI watchdog device is slow beyond imagination.

Confirmed: setting watchdog_device: off cluster wide got rid of the
above warnings.


Yep, good you found the issue. This is perfectly possible if ioctl blocks.

Its ioctl operations can take seconds, starving all other functions.
At least, it seems to block the main thread of Corosync.  Is this a
plausible scenario?  Corosync has two threads, what are their roles?

First (main) thread is basically doing almost everything. There is a main loop (epoll) I've described in previous mail.

Second thread is created by libqb and it's used only for logging. This is to prevent blocking of corosync when syslog/file log write blocks for some reason. It means some messages may be lost but it's still better than blocking.

Back to problem you have. It's definitively HW issue but I'm thinking how to solve it in software. Right now, I can see two ways: 1. Set dog FD to be non blocking right at the end of setup_watchdog - This is proffered but I'm not sure if it's really going to work.
2. Create thread which makes sure to tackle wd regularly.

Regards,
  Honza

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to