05.12.2017 12:59, Gao,Yan пишет: > On 12/04/2017 07:55 PM, Andrei Borzenkov wrote: >> 04.12.2017 14:48, Gao,Yan пишет: >>> On 12/02/2017 07:19 PM, Andrei Borzenkov wrote: >>>> 30.11.2017 13:48, Gao,Yan пишет: >>>>> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: >>>>>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with >>>>>> VM on VSphere using shared VMDK as SBD. During basic tests by killing >>>>>> corosync and forcing STONITH pacemaker was not started after reboot. >>>>>> In logs I see during boot >>>>>> >>>>>> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly >>>>>> just fenced by sapprod01p for sapprod01p >>>>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd >>>>>> process (3151) can no longer be respawned, >>>>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down >>>>>> Pacemaker >>>>>> >>>>>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that >>>>>> stonith with SBD always takes msgwait (at least, visually host is not >>>>>> declared as OFFLINE until 120s passed). But VM rebots lightning fast >>>>>> and is up and running long before timeout expires. >>>>>> >>>>>> I think I have seen similar report already. Is it something that can >>>>>> be fixed by SBD/pacemaker tuning? >>>>> SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution. >>>>> >>>> >>>> I tried it (on openSUSE Tumbleweed which is what I have at hand, it has >>>> SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch >>>> disk at all. >>> It simply waits that long on startup before starting the rest of the >>> cluster stack to make sure the fencing that targeted it has returned. It >>> intentionally doesn't watch anything during this period of time. >>> >> >> Unfortunately it waits too long. >> >> ha1:~ # systemctl status sbd.service >> ● sbd.service - Shared-storage based fencing daemon >> Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor >> preset: disabled) >> Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK; >> 4min 16s ago >> Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited, >> status=0/SUCCESS) >> Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid >> watch (code=killed, signa >> Main PID: 1792 (code=exited, status=0/SUCCESS) >> >> дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing >> daemon... >> дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out. >> Terminating. >> дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based >> fencing daemon. >> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state. >> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result >> 'timeout'. >> >> But the real problem is - in spite of SBD failed to start, the whole >> cluster stack continues to run; and because SBD blindly trusts in well >> behaving nodes, fencing appears to succeed after timeout ... without >> anyone taking any action on poison pill ... > Start of sbd reaches systemd's timeout for starting units and systemd > proceeds... >
You consider it normal and intended behavior? Again - currently it is possible that cluster stack starts without having working STONITH and because there is no confirmation whether stonith via SBD worked at all, we get into split brain. > TimeoutStartSec should be configured in sbd.service accordingly to be > longer than msgwait. > And where is it documented? You did not say it earlier, /etc/sysconfig/sbd does not say it, "man sbd" does not say it. How should users be aware about this? _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org