Hi Bertrand,

On 3/13/24 11:07, Bertrand Marquis wrote:
> Hi,
> 
>> On 8 Mar 2024, at 15:04, Julien Grall <jul...@xen.org> wrote:
>>
>> Hi John,
>>
>> Thank you for the reply.
>>
>> On 08/03/2024 13:40, John Ernberg wrote:
>>> On 3/7/24 00:07, Julien Grall wrote:
>>>>   > Ping on the watchdog discussion bits.
>>>>
>>>> Sorry for the late reply.
>>>>
>>>> On 06/03/2024 13:13, John Ernberg wrote:
>>>>> On 2/9/24 14:14, John Ernberg wrote:
>>>>>>
>>>>>>>      * IMX_SIP_TIMER_*:  This seems to be related to the watchdog.
>>>>>>> Shouldn't dom0 rely on the watchdog provided by Xen instead? So those
>>>>>>> call will be used by Xen.
>>>>>>
>>>>>> That is indeed a watchdog SIP, and also for setting the SoC internal RTC
>>>>>> if it is being used.
>>>>>>
>>>>>> I looked around if there was previous discussion and only really
>>>>>> found [3].
>>>>>> Is the xen/xen/include/watchdog.h header meant to be for this kind of
>>>>>> watchdog support or is that more for the VM watchdog? Looking at the x86
>>>>>> ACPI NMI watchdog it seems like the former, but I have never worked with
>>>>>> x86 nor ACPI.
>>>>
>>>> include/watchdog.h contains helper to configure the watchdog for Xen. We
>>>> also have per-VM watchdog and this is configured by the hypercall
>>>> SCHEDOP_watchdog.
>>>>
>>>>>>
>>>>>> Currently forwarding it to Dom0 has not caused any watchdog resets with
>>>>>> our watchdog timeout settings, our specific Dom0 setup and VM count.
>>>>
>>>> IIUC, the SMC API for the watchdog would be similar to the ACPI NMI
>>>> watchdog. So I think it would make more sense if this is not exposed to
>>>> dom0 (even if Xen is doing nothing with it).
>>>>
>>>> Can you try to hide the SMCs and check if dom0 still behave properly?
>>>>
>>>> Cheers,
>>>>
>>> This SMC manages a hardware watchdog, if it's not pinged within a
>>> specific interval the entire board resets.
>>
>> Do you know what's the default interval? Is it large enough so Xen + dom0 
>> can boot (at least up to when the watchdog driver is initialized)?
>>
>>> If I block the SMCs the watchdog driver in Dom0 will fail to ping the
>>> watchdog, triggering a board reset because the system looks to have
>>> become unresponsive. The reason this patch set started is because we
>>> couldn't ping the watchdog when running with Xen.
>>> In our specific system the bootloader enables the watchdog as early as
>>> possible so that we can get watchdog protection for as much of the boot
>>> as we possibly can.
>>> So, if we are to block the SMC from Dom0, then Xen needs to take over
>>> the pinging. It could be implemented similarly to the NMI watchdog,
>>> except that the system will reset if the ping is missed rather than
>>> backtrace.
>>> It would also mean that Xen will get a whole watchdog driver-category
>>> due to the watchdog being vendor and sometimes even SoC specific when it
>>> comes to Arm.
>>> My understanding of the domain watchdog code is that today the domain
>>> needs to call SCHEDOP_watchdog at least once to start the watchdog
>>> timer. Since watchdog protection through the whole boot process is
>>> desirable we'd need some core changes, such as an option to start the
>>> domain watchdog on init. >
>>> It's quite a big change to make
>>
>> For clarification, above you seem to mention two changes:
>>
>> 1) Allow Xen to use the HW watchdog
>> 2) Allow the domain to use the watchdog early
>>
>> I am assuming by big change, you are referring to 2?
>>
>> , while I am not against doing it if it
>>> makes sense, I now wonder if Xen should manage hardware watchdogs.
>>> Looking at the domain watchdog code it looks like if a domain does not
>>> get enough execution time, the watchdog will not be pinged enough and
>>> the guest will be reset. So either watchdog approach requires Dom0 to
>>> get execution time. Dom0 also needs to service all the PV backends it's
>>> responsible for. I'm not sure it's valuable to add another layer of
>>> watchdog for this scenario as the end result (checking that the entire
>>> system works) is achieved without it as well.
>>> So, before I try to find the time to make a proposal for moving the
>>> hardware watchdog bit to Xen, do we really want it?
>>
>> Thanks for the details. Given that the watchdog is enabled by the 
>> bootloader, I think we want Xen to drive the watchdog for two reasons:
>> 1) In true dom0less environment, dom0 would not exist
>> 2) You are relying on Xen + Dom0 to boot (or at least enough to get the 
>> watchdog working) within the watchdog interval.
> 
> Definitely we need to consider the case where there is no Dom0.
> 
> I think there are in fact 3 different use cases here:
> - watchdog fully driven in a domain (dom0 or another): would work if it is 
> expected
>    that Xen + Domain boot time is under the watchdog initial refresh rate. I 
> think this
>    could make sense on some applications where your system depends on a 
> specific
>    domain to be properly booted to work. This would require an initial 
> refresh time
>    configurable in the boot loader probably.

This is our use-case. ^

Our dom0 is monitoring and managing the other domains in our system.
Without dom0 working the system isn't really working as a whole.

@Julien: Would you be ok with the patch set continuing in the direction 
of the
original proposal, letting another party (or me at a later time) implement
the fully driven by Xen option?

> - watchdog fully driven by Xen. One might want to just make sure the 
> hypervisor is alive.
> - hybrid model where the watchdog is driven by Xen until a domain comes up to 
> drive it.
>    This could make sense to relax the stress on boot time but would raise the 
> question of
>     what should be done if the domain dies. This is also kind of complex as 
> Xen should stop
>     refreshing the watchdog when a domain starts doing it (might require a 
> trap and emulate
>     initially that is then mapped directly to a domain). I am not completely 
> sure this makes sense.
> 
> Regards
> Bertrand
> 

Thanks and best regards // John Ernberg

Reply via email to