01.12.2017 22:36, Gao,Yan пишет: > On 11/30/2017 06:48 PM, Andrei Borzenkov wrote: >> 30.11.2017 16:11, Klaus Wenninger пишет: >>> On 11/30/2017 01:41 PM, Ulrich Windl wrote: >>>> >>>>>>> "Gao,Yan" <y...@suse.com> schrieb am 30.11.2017 um 11:48 in >>>>>>> Nachricht >>>> <e71afccc-06e3-97dd-c66a-1b4bac550...@suse.com>: >>>>> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: >>>>>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with >>>>>> VM on VSphere using shared VMDK as SBD. During basic tests by killing >>>>>> corosync and forcing STONITH pacemaker was not started after reboot. >>>>>> In logs I see during boot >>>>>> >>>>>> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly >>>>>> just fenced by sapprod01p for sapprod01p >>>>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd >>>>>> process (3151) can no longer be respawned, >>>>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down >>>>> Pacemaker >>>>>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that >>>>>> stonith with SBD always takes msgwait (at least, visually host is not >>>>>> declared as OFFLINE until 120s passed). But VM rebots lightning fast >>>>>> and is up and running long before timeout expires. >>>> As msgwait was intended for the message to arrive, and not for the >>>> reboot time (I guess), this just shows a fundamental problem in SBD >>>> design: Receipt of the fencing command is not confirmed (other than >>>> by seeing the consequences of ist execution). >>> >>> The 2 x msgwait is not for confirmations but for writing the poison-pill >>> and for >>> having it read by the target-side. >> >> Yes, of course, but that's not what Urlich likely intended to say. >> msgwait must account for worst case storage path latency, while in >> normal cases it happens much faster. If fenced node could acknowledge >> having been killed after reboot, stonith agent could return success much >> earlier. > How could an alive man be sure he died before? ;) >
It does not need to. It simply needs to write something on startup to indicate it is back. Actually, fenced side already does it - it clears pending message when sbd is started. It is fencing side that simply unconditionally sleeps for msgwait: if (mbox_write_verify(st, mbox, s_mbox) < -1) { rc = -1; goto out; } if (strcasecmp(cmd, "exit") != 0) { cl_log(LOG_INFO, "Messaging delay: %d", (int)timeout_msgwait); sleep(timeout_msgwait); } What if we do not sleep but rather periodically check slot for acknowledgement for msgwait timeout? Then we could return earlier. _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org