>>> "Gao,Yan" <y...@suse.com> schrieb am 01.12.2017 um 20:36 in Nachricht <e49f3c0a-6981-3ab4-a0b0-1e5f49f34...@suse.com>: > On 11/30/2017 06:48 PM, Andrei Borzenkov wrote: >> 30.11.2017 16:11, Klaus Wenninger пишет: >>> On 11/30/2017 01:41 PM, Ulrich Windl wrote: >>>> >>>>>>> "Gao,Yan" <y...@suse.com> schrieb am 30.11.2017 um 11:48 in Nachricht >>>> <e71afccc-06e3-97dd-c66a-1b4bac550...@suse.com>: >>>>> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: >>>>>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with >>>>>> VM on VSphere using shared VMDK as SBD. During basic tests by killing >>>>>> corosync and forcing STONITH pacemaker was not started after reboot. >>>>>> In logs I see during boot >>>>>> >>>>>> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly >>>>>> just fenced by sapprod01p for sapprod01p >>>>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd >>>>>> process (3151) can no longer be respawned, >>>>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down >>>>> Pacemaker >>>>>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that >>>>>> stonith with SBD always takes msgwait (at least, visually host is not >>>>>> declared as OFFLINE until 120s passed). But VM rebots lightning fast >>>>>> and is up and running long before timeout expires. >>>> As msgwait was intended for the message to arrive, and not for the reboot
> time (I guess), this just shows a fundamental problem in SBD design: Receipt > of the fencing command is not confirmed (other than by seeing the > consequences of ist execution). >>> >>> The 2 x msgwait is not for confirmations but for writing the poison-pill >>> and for >>> having it read by the target-side. >> >> Yes, of course, but that's not what Urlich likely intended to say. >> msgwait must account for worst case storage path latency, while in >> normal cases it happens much faster. If fenced node could acknowledge >> having been killed after reboot, stonith agent could return success much >> earlier. > How could an alive man be sure he died before? ;) I meant: There are three delays: 1) The delay until data is on the disk 2) Delay until date is read from the disk 3) Delay until Host was killed A confirmation before 3) could shorten the total wait that includes 2) and 3), right? Regards, Ulrich > > Regards, > Yan > >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org