On 07/08/2025 6:29 pm, Marek Marczykowski-Górecki wrote: > On Wed, Aug 06, 2025 at 12:46:36PM +0200, Marek Marczykowski-Górecki wrote: >> On Wed, Aug 06, 2025 at 12:36:56PM +0200, Jan Beulich wrote: >>> On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote: >>>> We've got several reports that S3 reliability recently regressed. We >>>> identified it's definitely related to XSA-471 patches, and bisection >>>> points at "x86/idle: Remove broken MWAIT implementation". I don't have >>>> reliable reproduction steps, so I'm not 100% sure if it's really this >>>> patch, or maybe an earlier one - but it's definitely already broken at >>>> this point in the series. Most reports are about Xen 4.17 (as that's >>>> what stable Qubes OS version currently use), but I think I've seen >>>> somebody reporting the issue on 4.19 too (but I don't have clear >>>> evidence, especially if it's the same issue). >>> At the time we've been discussing the explicit raising of TIMER_SOFTIRQ >>> in mwait_idle_with_hints() a lot. If it was now truly missing, that imo >>> shouldn't cause problems only after resume, but then it may have covered >>> for some omission during resume. As a far-fetched experiment, could you >>> try putting that back (including the calculation of the "expires" local >>> variable)? >> Sure, I'll try. > It appears this fixes the issue, at least in ~10 attempts so far > (usually I could reproduce the issue after 2-3 attempts). >
Can you show the exact code which seems to have made this stable? We discussed this in the x86 maintainers meeting, and our best guess is a timer that's not torn down or recreated properly on S3, but this is largely speculation. ~Andrew