subject:"\[ClusterLabs\] Antw\: \[EXT\] Re\: normal reboot with active sbd does not work"

Re: [ClusterLabs] Antw: [EXT] Re: normal reboot with active sbd does not work

2022-06-07 Thread Klaus Wenninger

On Tue, Jun 7, 2022 at 7:53 AM Ulrich Windl
 wrote:
>
> >>> Andrei Borzenkov  schrieb am 03.06.2022 um 17:04 in
> Nachricht <99f7746a-c962-33bb-6737-f88ba0128...@gmail.com>:
> > On 03.06.2022 16:51, Zoran Bošnjak wrote:
> >> Thanks for all your answers. Sorry, my mistake. The ipmi_watchdog is indeed
>
> > OK. I was first experimenting with "softdog", which is blacklisted. So the
> > reasonable question is how to properly start "softdog" on ubuntu.
> >>
> >
> > blacklist prevents autoloading of modules by alias during hardware
> > detection. Neither softdog or ipmi_watchdog have any alias so they
> > cannot be autoloaded and blacklist is irrelevant here.
> >
> >> The reason to unload watchdog module (ipmi or softdog) is that there seems
>
> > to be a difference between normal reboot and watchdog reboot.
> >> In case of ipmi watchdog timer reboot:
> >> - the system hangs at the end of reboot cycle for some time
> >> - restart seems to be harder (like power off/on cycle), BIOS runs more
> > diagnostics at startup
>
> maybe kdump is enabled in that case?
>
> >> - it turns on HW diagnostic indication on the server front panel (dell
> > server) which stays on forever
> >> - it logs the event to IDRAC, which is unnecessary, because it was not a
> > hardware event, but just a normal reboot
>
> If the hardware watchdog times out and fires, it is consoidered to be an
> exceptional event that will be logged and reported.
>
> >>
> >> In case of "sudo reboot" command, I would like to skip this... so the idea
>
> > is to fully stop the watchdog just before reboot. I am not sure how to do
> > this properly.
> >>
> >> The "softdog" is better in this respect. It does not trigger nothing from
> > the list above, but I still get the message during reboot
> >> [ ... ] watchdog: watchdog0: watchdog did not stop!
> >> ... with some small timeout.
> >>
> >
> > The first obvious question - is there only one watchdog? Some watchdog
> > drivers *are* autoloaded.
> >
> > Is there only one user of watchdog? systemd may use it too as example.
>
> Don't mix timers with a watchdog: It makes little sense to habe multipe
> watchdogs enabled IMHO.

Yep that is an issue atm.

When you have multiple user of a hardware-watchdog like:
watchdog-daemon, sbd, corosync, systemd, ...

I'm not aware of an implementation that would provide multiple watchdog-timers
with the usual char-device-interface out of one physical.
Of course this should be relatively easy to implement - even in user-space.
On our embedded devices we usually had something like a service that
would offer multiple timers to other instances.
The implementation of that service itself was guarded by a hardware-watchdog
so that the derived timers would be as reliable as a hardware-watchdog.
Last implementation was built into watchdog-daemon and offered a dbus-interface.
What systemd has implemented is similarly interesting.
Current systemd-implementation has a suspicious loop around it that prevents
it from being fit for sbd-purposes as it doesn't guarantee a reboot within
a reasonably short time like this.
This is why I haven't yet implemented using the systemd-filedescriptor-approach
in sbd yet (as a configurable alternative to going for the device directly).
Approaching the systemd-guys and asking why it is implemented as it is has
been on my todo-list for a while now.

If you are running multiple-services on a host that don't offer something
like a common supervision main-loop it may make sense to offer a common
instance that offers something like a watchdog-service.
For a node that has all service under pacemaker-control this shouldn't be
needed as we have sbd observing pacemakerd. Pacemakerd in turn
observes the other pacemaker subdaemons (released with RHEL-8.6 and
iirc 2.1.3 upstream) guaranteeing that the monitors on the resources don't
get stuck.

Klaus
>
> >
> >> So after some additional testing, the situation is the following:
> >>
> >> - without any watchdog and without sbd package, the server reboots
> normally
> >> - with "softdog" module loaded, I only get "watchdog did not stop message"
>
> > at reboot
> >> - with "softdog" loaded, but unloaded with "ExecStop=...rmmod", reboot is
> > normal again
> >> - same as above, but with "sbd" package loaded, I am getting "watchdog did
>
> > not stop message" again
> >> - switching from "softdog" to "ipmi_watchdog" gets me to the original list
>
> > of problems
> >>
> >> It looks like the "sbd" is preventing the watchdog to close, so that
> > watchdog triggers always, even in the case of normal reboot. What am I
> > missing here?
>
> The watchdog may have a "no way out" parameter that prevents disabling it
> after enabled once.
>
> >
> > While the only way I can reproduce it on my QEMU VM is "reboot -f"
> > (without stopping all services), there is certainly a race condition in
> > sbd.service.
> >
> > ExecStop=@bindir@/kill -TERM $MAINPID
> >
> >
> > systemd will continue as soon as "kill" completes without waiting for
> > sbd

[ClusterLabs] Antw: [EXT] Re: normal reboot with active sbd does not work

2022-06-06 Thread Ulrich Windl

>>> Andrei Borzenkov  schrieb am 03.06.2022 um 17:04 in
Nachricht <99f7746a-c962-33bb-6737-f88ba0128...@gmail.com>:
> On 03.06.2022 16:51, Zoran Bošnjak wrote:
>> Thanks for all your answers. Sorry, my mistake. The ipmi_watchdog is indeed

> OK. I was first experimenting with "softdog", which is blacklisted. So the 
> reasonable question is how to properly start "softdog" on ubuntu.
>> 
> 
> blacklist prevents autoloading of modules by alias during hardware
> detection. Neither softdog or ipmi_watchdog have any alias so they
> cannot be autoloaded and blacklist is irrelevant here.
> 
>> The reason to unload watchdog module (ipmi or softdog) is that there seems

> to be a difference between normal reboot and watchdog reboot.
>> In case of ipmi watchdog timer reboot:
>> - the system hangs at the end of reboot cycle for some time
>> - restart seems to be harder (like power off/on cycle), BIOS runs more 
> diagnostics at startup

maybe kdump is enabled in that case?

>> - it turns on HW diagnostic indication on the server front panel (dell 
> server) which stays on forever
>> - it logs the event to IDRAC, which is unnecessary, because it was not a 
> hardware event, but just a normal reboot

If the hardware watchdog times out and fires, it is consoidered to be an
exceptional event that will be logged and reported.

>> 
>> In case of "sudo reboot" command, I would like to skip this... so the idea

> is to fully stop the watchdog just before reboot. I am not sure how to do 
> this properly.
>> 
>> The "softdog" is better in this respect. It does not trigger nothing from 
> the list above, but I still get the message during reboot
>> [ ... ] watchdog: watchdog0: watchdog did not stop!
>> ... with some small timeout.
>> 
> 
> The first obvious question - is there only one watchdog? Some watchdog
> drivers *are* autoloaded.
> 
> Is there only one user of watchdog? systemd may use it too as example.

Don't mix timers with a watchdog: It makes little sense to habe multipe
watchdogs enabled IMHO.

> 
>> So after some additional testing, the situation is the following:
>> 
>> - without any watchdog and without sbd package, the server reboots
normally
>> - with "softdog" module loaded, I only get "watchdog did not stop message"

> at reboot
>> - with "softdog" loaded, but unloaded with "ExecStop=...rmmod", reboot is 
> normal again
>> - same as above, but with "sbd" package loaded, I am getting "watchdog did

> not stop message" again
>> - switching from "softdog" to "ipmi_watchdog" gets me to the original list

> of problems
>> 
>> It looks like the "sbd" is preventing the watchdog to close, so that 
> watchdog triggers always, even in the case of normal reboot. What am I 
> missing here?

The watchdog may have a "no way out" parameter that prevents disabling it
after enabled once.

> 
> While the only way I can reproduce it on my QEMU VM is "reboot -f"
> (without stopping all services), there is certainly a race condition in
> sbd.service.
> 
> ExecStop=@bindir@/kill -TERM $MAINPID
> 
> 
> systemd will continue as soon as "kill" completes without waiting for
> sbd to actually stop. It means systemd may complete shutdown sequence
> before sbd had chance to react on signal and then simply kill it. Which
> leaves watchdog armed.
> 
> For test purpose try to use script that loops until sbd is actually
> stopped for ExecStop.
> 
> Note that systemd strongly recommends to use synchronous command for
> ExecStop (we may argue that this should be handled by service manager
> itself, but well ...).
> 
>> 
>> Zoran
>> 
>> - Original Message -
>> From: "Andrei Borzenkov" 
>> To: "users" 
>> Sent: Friday, June 3, 2022 11:24:03 AM
>> Subject: Re: [ClusterLabs] normal reboot with active sbd does not work
>> 
>> On 03.06.2022 11:18, Zoran Bošnjak wrote:
>>> Hi all,
>>> I would appreciate an advice about sbd fencing (without shared storage).
>>>
>>> I am using ubuntu 20.04., with default packages from the repository 
> (pacemaker, corosync, fence-agents, ipmitool, pcs...).
>>>
>>> HW watchdog is present on servers. The first problem was to load/unload
the 
> watchdog module. For some reason the module is blacklisted on ubuntu,
>> 
>> What makes you think so?
>> 
>> bor@bor-Latitude-E5450:~$ lsb_release  -d
>> 
>> Description: Ubuntu 20.04.4 LTS
>> 
>> bor@bor-Latitude-E5450:~$ modprobe -c | grep ipmi_watchdog
>> 
>> bor@bor-Latitude-E5450:~$
>> 
>> 
>> 
>> 
>> 
>>> so I've created a service for this purpose.
>>>
>> 
>> man modules-load.d
>> 
>> 
>>> --- file: /etc/systemd/system/watchdog.service
>>> [Unit]
>>> Description=Load watchdog timer module
>>> After=syslog.target
>>>
>> 
>> Without any explicit dependencies stop will be attempted as soon as
>> possible.
>> 
>>> [Service]
>>> Type=oneshot
>>> RemainAfterExit=yes
>>> ExecStart=/sbin/modprobe ipmi_watchdog
>>> ExecStop=/sbin/rmmod ipmi_watchdog
>>>
>> 
>> Why on earth do you need to unload kernel driver when system reboots?
>> 
>>> [Install]
>>>

[ClusterLabs] Antw: [EXT] Re: normal reboot with active sbd does not work

2022-06-03 Thread Ulrich Windl

>>> Klaus Wenninger  schrieb am 03.06.2022 um 11:03 in
Nachricht
:
> On Fri, Jun 3, 2022 at 10:19 AM Zoran Bošnjak  wrote:
...
> still opened by sbd. In general I don't see why the watchdog-module should
> be unloaded upon shutdown. So as a first try you just might remove that 

Spcifically if the actual watchdog is a hardware timer that isn't stopped when
the module is unloaded.

> part.
> 
> Klaus
> 
>>
>> regards,
>> Zoran
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: [EXT] Re: normal reboot with active sbd does not work

[ClusterLabs] Antw: [EXT] Re: normal reboot with active sbd does not work

[ClusterLabs] Antw: [EXT] Re: normal reboot with active sbd does not work

3 matches

Site Navigation

Mail list logo

Footer information