On Wed, Aug 13, 2025 at 09:26:09AM +0200, Roger Pau Monné wrote: > On Wed, Aug 13, 2025 at 04:53:53AM +0200, Marek Marczykowski-Górecki wrote: > > > > > > On August 11, 2025 3:16:46 PM GMT+02:00, Andrew Cooper > > <andrew.coop...@citrix.com> wrote: > > >On 07/08/2025 6:29 pm, Marek Marczykowski-Górecki wrote: > > >> On Wed, Aug 06, 2025 at 12:46:36PM +0200, Marek Marczykowski-Górecki > > >> wrote: > > >>> On Wed, Aug 06, 2025 at 12:36:56PM +0200, Jan Beulich wrote: > > >>>> On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote: > > >>>>> We've got several reports that S3 reliability recently regressed. We > > >>>>> identified it's definitely related to XSA-471 patches, and bisection > > >>>>> points at "x86/idle: Remove broken MWAIT implementation". I don't have > > >>>>> reliable reproduction steps, so I'm not 100% sure if it's really this > > >>>>> patch, or maybe an earlier one - but it's definitely already broken at > > >>>>> this point in the series. Most reports are about Xen 4.17 (as that's > > >>>>> what stable Qubes OS version currently use), but I think I've seen > > >>>>> somebody reporting the issue on 4.19 too (but I don't have clear > > >>>>> evidence, especially if it's the same issue). > > >>>> At the time we've been discussing the explicit raising of TIMER_SOFTIRQ > > >>>> in mwait_idle_with_hints() a lot. If it was now truly missing, that imo > > >>>> shouldn't cause problems only after resume, but then it may have > > >>>> covered > > >>>> for some omission during resume. As a far-fetched experiment, could you > > >>>> try putting that back (including the calculation of the "expires" local > > >>>> variable)? > > >>> Sure, I'll try. > > >> It appears this fixes the issue, at least in ~10 attempts so far > > >> (usually I could reproduce the issue after 2-3 attempts). > > >> > > > > > >Can you show the exact code which seems to have made this stable? > > > > This patch: > > https://github.com/marmarek/qubes-vmm-xen/blob/7c9e9e312948c772d9a68090109964121c1d16fe/0001-DEBUG-S3.patch > > Hello, > > Can you test if the patch below in isolation makes any difference?
Seems to help too. At least a test similar as before did not hit the issue anymore. > Thanks, Roger. > --- > diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c > index 2ac162c997fe..27d672ad5dbb 100644 > --- a/xen/arch/x86/acpi/power.c > +++ b/xen/arch/x86/acpi/power.c > @@ -19,6 +19,7 @@ > #include <xen/iommu.h> > #include <xen/param.h> > #include <xen/sched.h> > +#include <xen/softirq.h> > #include <xen/spinlock.h> > #include <xen/watchdog.h> > > @@ -310,6 +311,7 @@ static int enter_state(u32 state) > thaw_domains(); > system_state = SYS_STATE_active; > spin_unlock(&pm_lock); > + raise_softirq(TIMER_SOFTIRQ); > return error; > } > > diff --git a/xen/arch/x86/apic.c b/xen/arch/x86/apic.c > index 0fd8bdba7067..9d66db861b74 100644 > --- a/xen/arch/x86/apic.c > +++ b/xen/arch/x86/apic.c > @@ -65,7 +65,6 @@ static struct { > unsigned int apic_lvt0; > unsigned int apic_lvt1; > unsigned int apic_lvterr; > - unsigned int apic_tmict; > unsigned int apic_tdcr; > unsigned int apic_thmr; > } apic_pm_state; > @@ -658,7 +657,6 @@ int lapic_suspend(void) > apic_pm_state.apic_lvt0 = apic_read(APIC_LVT0); > apic_pm_state.apic_lvt1 = apic_read(APIC_LVT1); > apic_pm_state.apic_lvterr = apic_read(APIC_LVTERR); > - apic_pm_state.apic_tmict = apic_read(APIC_TMICT); > apic_pm_state.apic_tdcr = apic_read(APIC_TDCR); > if (maxlvt >= 5) > apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR); > @@ -718,7 +716,7 @@ int lapic_resume(void) > apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc); > apic_write(APIC_LVTT, apic_pm_state.apic_lvtt); > apic_write(APIC_TDCR, apic_pm_state.apic_tdcr); > - apic_write(APIC_TMICT, apic_pm_state.apic_tmict); > + apic_write(APIC_TMICT, 0); > apic_write(APIC_ESR, 0); > apic_read(APIC_ESR); > apic_write(APIC_LVTERR, apic_pm_state.apic_lvterr); > -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab
signature.asc
Description: PGP signature