04.12.2017 21:55, Andrei Borzenkov пишет: ... >>> >>> I tried it (on openSUSE Tumbleweed which is what I have at hand, it has >>> SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch >>> disk at all. >> It simply waits that long on startup before starting the rest of the >> cluster stack to make sure the fencing that targeted it has returned. It >> intentionally doesn't watch anything during this period of time. >> > > Unfortunately it waits too long. > > ha1:~ # systemctl status sbd.service > ● sbd.service - Shared-storage based fencing daemon > Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor > preset: disabled) > Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK; > 4min 16s ago > Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited, > status=0/SUCCESS) > Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid > watch (code=killed, signa > Main PID: 1792 (code=exited, status=0/SUCCESS) > > дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing > daemon... > дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out. > Terminating. > дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based > fencing daemon. > дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state. > дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'. > > But the real problem is - in spite of SBD failed to start, the whole > cluster stack continues to run; and because SBD blindly trusts in well > behaving nodes, fencing appears to succeed after timeout ... without > anyone taking any action on poison pill ... >
That's sbd bug. It declares itself as RequiredBy=corosync.service but puts itself Before=pacemaker.service. Due to systemd design, service A *MUST* have Before dependency on service B if failure to start A should cause failure to start B. *Or* use BindsTo ... but that sounds wrong because it would cause B to start briefly and then be killed. So the question is what is intended here. Should sbd.service be prerequisite for corosync or pacemaker? Should failure to start SBD be fatal for startup of dependent service? Finally does sbd need explicit dependency on pacemaker.service at all (in addition to corosync.service)? Adding Before dependency fixes startup logic for me. ha1:~ # systemctl start pacemaker.service A dependency job for pacemaker.service failed. See 'journalctl -xe' for details. ha1:~ # systemctl -l --no-pager status pacemaker.service ● pacemaker.service - Pacemaker High Availability Cluster Manager Loaded: loaded (/etc/systemd/system/pacemaker.service; disabled; vendor preset: disabled) Active: inactive (dead) Docs: man:pacemakerd http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html дек 16 18:56:06 ha1 systemd[1]: Dependency failed for Pacemaker High Availability Cluster Manager. дек 16 18:56:06 ha1 systemd[1]: pacemaker.service: Job pacemaker.service/start failed with result 'dependency'. ha1:~ # systemctl -l --no-pager status corosync.service ● corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; static; vendor preset: disabled) Active: inactive (dead) Docs: man:corosync man:corosync.conf man:corosync_overview дек 16 18:56:06 ha1 systemd[1]: Dependency failed for Corosync Cluster Engine. дек 16 18:56:06 ha1 systemd[1]: corosync.service: Job corosync.service/start failed with result 'dependency'. ha1:~ # systemctl -l --no-pager status sbd.service ● sbd.service - Shared-storage based fencing daemon Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/sbd.service.d └─before-corosync.conf Active: failed (Result: timeout) since Sat 2017-12-16 18:56:06 MSK; 50s ago Process: 3675 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch (code=killed, signal=TERM) дек 16 18:54:36 ha1 systemd[1]: Starting Shared-storage based fencing daemon... дек 16 18:56:06 ha1 systemd[1]: sbd.service: Start operation timed out. Terminating. дек 16 18:56:06 ha1 systemd[1]: Failed to start Shared-storage based fencing daemon. дек 16 18:56:06 ha1 systemd[1]: sbd.service: Unit entered failed state. дек 16 18:56:06 ha1 systemd[1]: sbd.service: Failed with result 'timeout'. ha1:~ # cat /etc/systemd/system/sbd.service.d/before-corosync.conf [Unit] Before=corosync.service ha1:~ # _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org