Steffen, did you mean to reply to a different thread? I ask because there's another active one and you haven't been involved in this one yet :)
On Wed, Jan 6, 2021 at 11:01 PM Steffen Vinther Sørensen <svint...@gmail.com> wrote: > > No I don't set that, and its not showing up in output of 'pcs stonith > show --full' > > On Thu, Jan 7, 2021 at 7:46 AM Reid Wahl <nw...@redhat.com> wrote: > > > > Do you have pcmk_monitor_timeout set? Ken mentioned that the start > > operation includes a registration and a status call. There was a bug > > in which the status call used the pcmk_monitor_timeout (if set), but > > the start operation that triggered it used the default timeout. So if > > the pcmk_monitor_timeout was long enough (say, 120 seconds) and the > > status call did not complete by then, the start operation would time > > out without receiving an action result. This was fixed in pull > > 2108[1]. > > > > Also, note that the error message said, "Node h18 did not send start > > result (via controller) within 45000ms (action timeout plus > > cluster-delay)". The start operation began at 09:29:29 and ended at > > 09:31:14 (1min45sec). That's because the cluster-delay is 60s by > > default and there was a logging bug[2] at one point: the time value in > > the log message did not add the cluster-delay. > > > > [1] https://github.com/ClusterLabs/pacemaker/pull/2108 > > [2] https://github.com/ClusterLabs/pacemaker/commit/b542a8f > > > > On Fri, Dec 18, 2020 at 8:15 AM Ken Gaillot <kgail...@redhat.com> wrote: > > > > > > On Fri, 2020-12-18 at 13:32 +0100, Ulrich Windl wrote: > > > > > > > Andrei Borzenkov <arvidj...@gmail.com> schrieb am 18.12.2020 um > > > > > > > 12:17 in > > > > > > > > Nachricht <b82fc4d8-689c-4357-8f22-adc957fa6...@gmail.com>: > > > > > 18.12.2020 12:00, Ulrich Windl пишет: > > > > > > > > > > > > Maybe a related question: Do STONITH resources have special > > > > > > rules, meaning > > > > > they don't wait for successful fencing? > > > > > > > > > > pacemaker resources in CIB do not perform fencing. They only > > > > > register > > > > > fencing devices with fenced which does actual job. In particular > > > > > ... > > > > > > > > > > > I saw this between fencing being initiated and fencing being > > > > > > confirmed (h16 > > > > > was DC, now h18 became DC): > > > > > > > > > > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]: notice: Processing > > > > > > graph 0 > > > > > > > > > > (ref=pe_calc-dc-1608280169-21) derived from > > > > > /var/lib/pacemaker/pengine/pe-warn-9.bz2 > > > > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]: notice: Requesting > > > > > > fencing > > > > > > > > > > (reboot) of node h16 > > > > > > Dec 18 09:29:29 h18 pacemaker-controld[4479]: notice: Initiating > > > > > > start > > > > > > > > > > operation prm_stonith_sbd_start_0 locally on h18 > > > > > > > > > > ... "start" operation on pacemaker stonith resource only registers > > > > > this > > > > > device with fenced. It does *not* initiate stonith operation. > > > > > > > > Hi! > > > > > > > > Thanks, it's quite confusing: "notice: Initiating start operation" > > > > sounds like > > > > something is to be started right now; if it's just scheduled, > > > > "notice: Queueing > > > > start operation" or "notice: Planning start operation" would be a > > > > better phrase > > > > IMHO. > > > > > > I think what Andrei is suggesting is that the start is not a fencing > > > (reboot/off) action, just a registration of the device with the fencer. > > > The start (registration) itself does happen at this point. Besides > > > registration, it also involves a status call to the device. > > > > > > > > > > > > > > ... > > > > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]: error: Node h18 > > > > > > did not send > > > > > start result (via controller) within 45000ms (action timeout plus > > > > > cluster-delay) > > > > > > > > > > I am not sure what happens here. Somehow fenced took very long time > > > > > to > > > > > respond or something with communication between them. > > > > > > > > This looks new in the current pacemaker. As explained in an earlier > > > > message we > > > > use a rather long fencing/stonith timeout, so the confirmation may > > > > arrive > > > > rather late (but still before the node gets online again). I din't > > > > see this in > > > > comparable configurations using older pacemaker. > > > > > > Only the message is new ... previously it was something inscrutable > > > like "Timer popped". > > > > > > Actions involve multiple steps across multiple daemons, so there are > > > multiple places where something could time out. In this case, the > > > controller didn't get a result from the fencer in time. > > > > > > Normally, this should never happen, because if the action itself times > > > out, the fencer will report that. > > > > > > > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]: error: > > > > > > [Action 22]: > > > > > > > > > > In-flight resource op prm_stonith_sbd_start_0 on h18 > > > > > (priority: 9900, > > > > > waiting: (null)) > > > > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]: notice: Transition > > > > > > 0 > > > > > > > > aborted: > > > > > Action lost > > > > > > Dec 18 09:31:14 h18 pacemaker-controld[4479]: warning: rsc_op > > > > > > 22: > > > > > > > > > > prm_stonith_sbd_start_0 on h18 timed out > > > > > > The above is just the consequence of not getting the start result -- > > > the controller considers the action timed out and aborts the current > > > transition. > > > > > > > > > ... > > > > > > Dec 18 09:31:15 h18 pacemaker-controld[4479]: notice: Peer h16 > > > > > > was > > > > > > > > > > terminated (reboot) by h18 on behalf of pacemaker-controld.4527: OK > > > > > > Here the action fencing action returns success > > > > > > > > > Dec 18 09:31:17 h18 pacemaker-execd[4476]: notice: > > > > > > prm_stonith_sbd start > > > > > > > > > > (call 164) exited with status 0 (execution time 110960ms, queue > > > > > time > > > > > > > > 15001ms) > > > > > > The device registration already in progress does eventually complete > > > successfully. The controller has already considered it lost and moved > > > on, so this result will be ignored. > > > > > > Looking at the queue time, I'm going to guess that what happened was > > > that the status action that's part of the start was queued behind the > > > reboot action in progress, and the queue time was enough to push it > > > over the total expected timeout. > > > > Looks plausible. Though it's strange to me that the queue time is only > > 15 seconds, that the reboot operation completes immediately after the > > start operation fails, and that the executor then says the start > > completed successfully after 110s of execution time :/ > > > > > This is a known issue when fence actions are serialized. The fencer > > > starts its timeout once the action is begun (i.e. after queueing), but > > > the controller doesn't know about the queueing and starts its timeout > > > once it has submitted the request to the fencer (i.e. before queueing). > > > > > > Red Hat BZ#1858462 is related, but there's no ClusterLabs BZ at this > > > point. It's a tricky problem, so it will be a significant project. > > > > > > > > It could be related to pending fencing but I am not familiar with > > > > > low > > > > > level details. > > > > > > > > It looks odd: First "started", then timed out with error, then > > > > successful > > > > (without being rescheduled it seems). > > > > > > > > > > > > > > > ... > > > > > > Dec 18 09:31:30 h18 pacemaker-controld[4479]: notice: Peer h16 > > > > > > was > > > > > > > > > > terminated (reboot) by h19 on behalf of pacemaker-controld.4479: OK > > > > > > Dec 18 09:31:30 h18 pacemaker-controld[4479]: notice: Transition > > > > > > 0 > > > > > > > > > > (Complete=31, Pending=0, Fired=0, Skipped=1, Incomplete=3, > > > > > Source=/var/lib/pacemaker/pengine/pe-warn-9.bz2): Stopped > > > > > > > > So here's the delayed stonith confirmation. > > > > > > > > > > ... > > > > > > Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]: warning: > > > > > > Unexpected result > > > > > (error) was recorded for start of prm_stonith_sbd on h18 at Dec 18 > > > > > 09:31:14 > > > > > 2020 > > > > > > Dec 18 09:31:30 h18 pacemaker-schedulerd[4478]: notice: * > > > > > > Recover > > > > > > > > > > prm_stonith_sbd ( h18 ) > > > > > > > > Then after successful start another "recovery". Isn't that very odd? > > > > > > The first start was considered timed out, even though it eventually > > > completed successfully, so the device had to be recovered (stop+start) > > > due to the start timeout. > > > > > > > > > > > Regards, > > > > Ulrich > > > -- > > > Ken Gaillot <kgail...@redhat.com> > > > > > > _______________________________________________ > > > Manage your subscription: > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > > > > -- > > Regards, > > > > Reid Wahl, RHCA > > Senior Software Maintenance Engineer, Red Hat > > CEE - Platform Support Delivery - ClusterHA > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Regards, Reid Wahl, RHCA Senior Software Maintenance Engineer, Red Hat CEE - Platform Support Delivery - ClusterHA _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/