Re: [ClusterLabs] Antw: [EXT] Re: normal reboot with active sbd does not work
On Tue, Jun 7, 2022 at 7:53 AM Ulrich Windl wrote: > > >>> Andrei Borzenkov schrieb am 03.06.2022 um 17:04 in > Nachricht <99f7746a-c962-33bb-6737-f88ba0128...@gmail.com>: > > On 03.06.2022 16:51, Zoran Bošnjak wrote: > >> Thanks for all your answers. Sorry, my mistake. The ipmi_watchdog is indeed > > > OK. I was first experimenting with "softdog", which is blacklisted. So the > > reasonable question is how to properly start "softdog" on ubuntu. > >> > > > > blacklist prevents autoloading of modules by alias during hardware > > detection. Neither softdog or ipmi_watchdog have any alias so they > > cannot be autoloaded and blacklist is irrelevant here. > > > >> The reason to unload watchdog module (ipmi or softdog) is that there seems > > > to be a difference between normal reboot and watchdog reboot. > >> In case of ipmi watchdog timer reboot: > >> - the system hangs at the end of reboot cycle for some time > >> - restart seems to be harder (like power off/on cycle), BIOS runs more > > diagnostics at startup > > maybe kdump is enabled in that case? > > >> - it turns on HW diagnostic indication on the server front panel (dell > > server) which stays on forever > >> - it logs the event to IDRAC, which is unnecessary, because it was not a > > hardware event, but just a normal reboot > > If the hardware watchdog times out and fires, it is consoidered to be an > exceptional event that will be logged and reported. > > >> > >> In case of "sudo reboot" command, I would like to skip this... so the idea > > > is to fully stop the watchdog just before reboot. I am not sure how to do > > this properly. > >> > >> The "softdog" is better in this respect. It does not trigger nothing from > > the list above, but I still get the message during reboot > >> [ ... ] watchdog: watchdog0: watchdog did not stop! > >> ... with some small timeout. > >> > > > > The first obvious question - is there only one watchdog? Some watchdog > > drivers *are* autoloaded. > > > > Is there only one user of watchdog? systemd may use it too as example. > > Don't mix timers with a watchdog: It makes little sense to habe multipe > watchdogs enabled IMHO. Yep that is an issue atm. When you have multiple user of a hardware-watchdog like: watchdog-daemon, sbd, corosync, systemd, ... I'm not aware of an implementation that would provide multiple watchdog-timers with the usual char-device-interface out of one physical. Of course this should be relatively easy to implement - even in user-space. On our embedded devices we usually had something like a service that would offer multiple timers to other instances. The implementation of that service itself was guarded by a hardware-watchdog so that the derived timers would be as reliable as a hardware-watchdog. Last implementation was built into watchdog-daemon and offered a dbus-interface. What systemd has implemented is similarly interesting. Current systemd-implementation has a suspicious loop around it that prevents it from being fit for sbd-purposes as it doesn't guarantee a reboot within a reasonably short time like this. This is why I haven't yet implemented using the systemd-filedescriptor-approach in sbd yet (as a configurable alternative to going for the device directly). Approaching the systemd-guys and asking why it is implemented as it is has been on my todo-list for a while now. If you are running multiple-services on a host that don't offer something like a common supervision main-loop it may make sense to offer a common instance that offers something like a watchdog-service. For a node that has all service under pacemaker-control this shouldn't be needed as we have sbd observing pacemakerd. Pacemakerd in turn observes the other pacemaker subdaemons (released with RHEL-8.6 and iirc 2.1.3 upstream) guaranteeing that the monitors on the resources don't get stuck. Klaus > > > > >> So after some additional testing, the situation is the following: > >> > >> - without any watchdog and without sbd package, the server reboots > normally > >> - with "softdog" module loaded, I only get "watchdog did not stop message" > > > at reboot > >> - with "softdog" loaded, but unloaded with "ExecStop=...rmmod", reboot is > > normal again > >> - same as above, but with "sbd" package loaded, I am getting "watchdog did > > > not stop message" again > >> - switching from "softdog" to "ipmi_watchdog" gets me to the original list > > > of problems > >> > >> It looks like the "sbd" is preventing the watchdog to close, so that > > watchdog triggers always, even in the case of normal reboot. What am I > > missing here? > > The watchdog may have a "no way out" parameter that prevents disabling it > after enabled once. > > > > > While the only way I can reproduce it on my QEMU VM is "reboot -f" > > (without stopping all services), there is certainly a race condition in > > sbd.service. > > > > ExecStop=@bindir@/kill -TERM $MAINPID > > > > > > systemd will continue as soon as "kill" completes without waiting for > > sbd
[ClusterLabs] Antw: [EXT] Re: normal reboot with active sbd does not work
>>> Andrei Borzenkov schrieb am 03.06.2022 um 17:04 in Nachricht <99f7746a-c962-33bb-6737-f88ba0128...@gmail.com>: > On 03.06.2022 16:51, Zoran Bošnjak wrote: >> Thanks for all your answers. Sorry, my mistake. The ipmi_watchdog is indeed > OK. I was first experimenting with "softdog", which is blacklisted. So the > reasonable question is how to properly start "softdog" on ubuntu. >> > > blacklist prevents autoloading of modules by alias during hardware > detection. Neither softdog or ipmi_watchdog have any alias so they > cannot be autoloaded and blacklist is irrelevant here. > >> The reason to unload watchdog module (ipmi or softdog) is that there seems > to be a difference between normal reboot and watchdog reboot. >> In case of ipmi watchdog timer reboot: >> - the system hangs at the end of reboot cycle for some time >> - restart seems to be harder (like power off/on cycle), BIOS runs more > diagnostics at startup maybe kdump is enabled in that case? >> - it turns on HW diagnostic indication on the server front panel (dell > server) which stays on forever >> - it logs the event to IDRAC, which is unnecessary, because it was not a > hardware event, but just a normal reboot If the hardware watchdog times out and fires, it is consoidered to be an exceptional event that will be logged and reported. >> >> In case of "sudo reboot" command, I would like to skip this... so the idea > is to fully stop the watchdog just before reboot. I am not sure how to do > this properly. >> >> The "softdog" is better in this respect. It does not trigger nothing from > the list above, but I still get the message during reboot >> [ ... ] watchdog: watchdog0: watchdog did not stop! >> ... with some small timeout. >> > > The first obvious question - is there only one watchdog? Some watchdog > drivers *are* autoloaded. > > Is there only one user of watchdog? systemd may use it too as example. Don't mix timers with a watchdog: It makes little sense to habe multipe watchdogs enabled IMHO. > >> So after some additional testing, the situation is the following: >> >> - without any watchdog and without sbd package, the server reboots normally >> - with "softdog" module loaded, I only get "watchdog did not stop message" > at reboot >> - with "softdog" loaded, but unloaded with "ExecStop=...rmmod", reboot is > normal again >> - same as above, but with "sbd" package loaded, I am getting "watchdog did > not stop message" again >> - switching from "softdog" to "ipmi_watchdog" gets me to the original list > of problems >> >> It looks like the "sbd" is preventing the watchdog to close, so that > watchdog triggers always, even in the case of normal reboot. What am I > missing here? The watchdog may have a "no way out" parameter that prevents disabling it after enabled once. > > While the only way I can reproduce it on my QEMU VM is "reboot -f" > (without stopping all services), there is certainly a race condition in > sbd.service. > > ExecStop=@bindir@/kill -TERM $MAINPID > > > systemd will continue as soon as "kill" completes without waiting for > sbd to actually stop. It means systemd may complete shutdown sequence > before sbd had chance to react on signal and then simply kill it. Which > leaves watchdog armed. > > For test purpose try to use script that loops until sbd is actually > stopped for ExecStop. > > Note that systemd strongly recommends to use synchronous command for > ExecStop (we may argue that this should be handled by service manager > itself, but well ...). > >> >> Zoran >> >> - Original Message - >> From: "Andrei Borzenkov" >> To: "users" >> Sent: Friday, June 3, 2022 11:24:03 AM >> Subject: Re: [ClusterLabs] normal reboot with active sbd does not work >> >> On 03.06.2022 11:18, Zoran Bošnjak wrote: >>> Hi all, >>> I would appreciate an advice about sbd fencing (without shared storage). >>> >>> I am using ubuntu 20.04., with default packages from the repository > (pacemaker, corosync, fence-agents, ipmitool, pcs...). >>> >>> HW watchdog is present on servers. The first problem was to load/unload the > watchdog module. For some reason the module is blacklisted on ubuntu, >> >> What makes you think so? >> >> bor@bor-Latitude-E5450:~$ lsb_release -d >> >> Description: Ubuntu 20.04.4 LTS >> >> bor@bor-Latitude-E5450:~$ modprobe -c | grep ipmi_watchdog >> >> bor@bor-Latitude-E5450:~$ >> >> >> >> >> >>> so I've created a service for this purpose. >>> >> >> man modules-load.d >> >> >>> --- file: /etc/systemd/system/watchdog.service >>> [Unit] >>> Description=Load watchdog timer module >>> After=syslog.target >>> >> >> Without any explicit dependencies stop will be attempted as soon as >> possible. >> >>> [Service] >>> Type=oneshot >>> RemainAfterExit=yes >>> ExecStart=/sbin/modprobe ipmi_watchdog >>> ExecStop=/sbin/rmmod ipmi_watchdog >>> >> >> Why on earth do you need to unload kernel driver when system reboots? >> >>> [Install] >>>
[ClusterLabs] Antw: [EXT] Re: normal reboot with active sbd does not work
>>> Klaus Wenninger schrieb am 03.06.2022 um 11:03 in Nachricht : > On Fri, Jun 3, 2022 at 10:19 AM Zoran Bošnjak wrote: ... > still opened by sbd. In general I don't see why the watchdog-module should > be unloaded upon shutdown. So as a first try you just might remove that Spcifically if the actual watchdog is a hardware timer that isn't stopped when the module is unloaded. > part. > > Klaus > >> >> regards, >> Zoran >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/