04.12.2017 14:48, Gao,Yan пишет: > On 12/02/2017 07:19 PM, Andrei Borzenkov wrote: >> 30.11.2017 13:48, Gao,Yan пишет: >>> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: >>>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with >>>> VM on VSphere using shared VMDK as SBD. During basic tests by killing >>>> corosync and forcing STONITH pacemaker was not started after reboot. >>>> In logs I see during boot >>>> >>>> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly >>>> just fenced by sapprod01p for sapprod01p >>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd >>>> process (3151) can no longer be respawned, >>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down >>>> Pacemaker >>>> >>>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that >>>> stonith with SBD always takes msgwait (at least, visually host is not >>>> declared as OFFLINE until 120s passed). But VM rebots lightning fast >>>> and is up and running long before timeout expires. >>>> >>>> I think I have seen similar report already. Is it something that can >>>> be fixed by SBD/pacemaker tuning? >>> SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution. >>> >> >> I tried it (on openSUSE Tumbleweed which is what I have at hand, it has >> SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch >> disk at all. > It simply waits that long on startup before starting the rest of the > cluster stack to make sure the fencing that targeted it has returned. It > intentionally doesn't watch anything during this period of time. >
Unfortunately it waits too long. ha1:~ # systemctl status sbd.service ● sbd.service - Shared-storage based fencing daemon Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: disabled) Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK; 4min 16s ago Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited, status=0/SUCCESS) Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch (code=killed, signa Main PID: 1792 (code=exited, status=0/SUCCESS) дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing daemon... дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out. Terminating. дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based fencing daemon. дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state. дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'. But the real problem is - in spite of SBD failed to start, the whole cluster stack continues to run; and because SBD blindly trusts in well behaving nodes, fencing appears to succeed after timeout ... without anyone taking any action on poison pill ... ha1:~ # systemctl show sbd.service -p RequiredBy RequiredBy=corosync.service but ha1:~ # systemctl status corosync.service ● corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; static; vendor preset: disabled) Active: active (running) since Mon 2017-12-04 21:45:33 MSK; 7min ago Docs: man:corosync man:corosync.conf man:corosync_overview Process: 1860 ExecStop=/usr/share/corosync/corosync stop (code=exited, status=0/SUCCESS) Process: 2059 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS) Main PID: 2073 (corosync) Tasks: 2 (limit: 4915) CGroup: /system.slice/corosync.service └─2073 corosync and ha1:~ # crm_mon -1r Stack: corosync Current DC: ha1 (version 1.1.17-3.3-36d2962a8) - partition with quorum Last updated: Mon Dec 4 21:53:24 2017 Last change: Mon Dec 4 21:47:25 2017 by hacluster via crmd on ha1 2 nodes configured 1 resource configured Online: [ ha1 ha2 ] Full list of resources: stonith-sbd (stonith:external/sbd): Started ha1 and if I now sever connection between two nodes I will get two single node clusters each believing it won ... _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org