On 17.03.2025 18:38, Roger Pau Monné wrote:
> On Mon, Mar 17, 2025 at 05:11:56PM +0100, Jan Beulich wrote:
>> On 17.03.2025 16:56, Roger Pau Monné wrote:
>>> On Sat, Mar 15, 2025 at 12:02:50AM +0000, Andrew Cooper wrote:
>>>> On 14/03/2025 11:53 pm, Marek Marczykowski-Górecki wrote:
>>>>> On Fri, Mar 14, 2025 at 11:23:28PM +0100, Marek Marczykowski-Górecki 
>>>>> wrote:
>>>>>> On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
>>>>>>> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
>>>>>>>> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
>>>>>>>>
>>>>>>>> This one has working S3, so add a test for it here.
>>>>>>>>
>>>>>>>> Signed-off-by: Marek Marczykowski-Górecki 
>>>>>>>> <marma...@invisiblethingslab.com>
>>>>>>>> ---
>>>>>>>> Cc: Jan Beulich <jbeul...@suse.com>
>>>>>>>> Cc: Andrew Cooper <andrew.coop...@citrix.com>
>>>>>>>>
>>>>>>>> The suspend test added here currently fails on staging[1], but passes 
>>>>>>>> on
>>>>>>>> staging-4.19[2]. So the regression wants fixing before committing this
>>>>>>>> patch.
>>>>>>>>
>>>>>>>> [1] 
>>>>>>>> https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
>>>>>>>> [2] 
>>>>>>>> https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
>>>>>>> We could commit the patch now without the s3 test.
>>>>>>>
>>>>>>> I don't know what the x86 maintainers think about fixing the suspend
>>>>>>> bug, but one idea would be to run a bisection between 4.20 and 4.19.
>>>>>> I'm on it already, but it's annoying. Lets convert this thread to
>>>>>> discussion about the issue:
>>>>>>
>>>>>> So, I bisected it between staging-4.19 and master. The breakage is
>>>>>> somewhere between (inclusive):
>>>>>> eb21ce14d709 x86/boot: Rewrite EFI/MBI2 code partly in C
>>>>>> and
>>>>>> 47990ecef286 x86/boot: Improve MBI2 structure check
>>>>>>
>>>>>> But, the first one breaks booting on this system and it remains broken
>>>>>> until the second commit (or its parent) - at which point S3 is already
>>>>>> broken. So, there is a range of 71 commits that may be responsible...
>>>>>>
>>>>>> But then, based on a matrix chat and Jan's observation I've tried
>>>>>> reverting f75780d26b2f "xen: move per-cpu area management into common
>>>>>> code" just on top of 47990ecef286, and that fixed suspend.
>>>>>> Applying "xen/percpu: don't initialize percpu on resume" on top of
>>>>>> 47990ecef286 fixes suspend too.
>>>>>> But applying it on top of master
>>>>>> (91772f8420dfa2fcfe4db68480c216db5b79c512 specifically) does not fix it,
>>>>>> but the failure mode is different than without the patch - system resets
>>>>>> on S3 resume, with no crash message on the serial console (even with
>>>>>> sync_console), instead of hanging.
>>>>>> And one more data point: reverting f75780d26b2f on top of master is the
>>>>>> same as applying "xen/percpu: don't initialize percpu on resume" on
>>>>>> master - system reset on S3 resume.
>>>>>> So, it looks like there are more issues...
>>>>> Another bisection round and I have the second culprit:
>>>>>
>>>>>     8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT 
>>>>> index hasn't changed
>>>>>
>>>>> With master+"xen/percpu: don't initialize percpu on resume"+revert of
>>>>> 8e60d47cf011 suspend works again on this AMD system.
>>>>
>>>> That's not surprising in the slightest.
>>>>
>>>> Caching hardware values in Xen isn't safe across S3, which QubesOS has
>>>> found time and time again, and for which we still have outstanding bugs.
>>>>
>>>> S3 turns most of the system off.  RAM gets preserved, but devices and
>>>> plenty of internal registers don't.
>>>
>>> I think I've spotted the issue.  enable_iommu() called on resume
>>> (ab)uses set_msi_affinity() to force an MSI register write, as it's
>>> previous behavior was to unconditionally propagate the values.  With
>>> my change it would no longer perform such writes on resume.
>>>
>>> I think the patch below should help.
>>>
>>> I wonder if we should unconditionally propagate the write from
>>> __setup_msi_irq(), as it's also unlikely to make any difference to
>>> skip that write, and would further keep the previous behavior.
>>
>> That might be better. In fact I wonder whether it wouldn't better be
>> callers of write_msi_msg() to use ...
>>
>>> ---
>>> commit 1d9bfd0d45f6b547b19f0d2f752fc3bd10103971
>>> Author: Roger Pau Monne <roger....@citrix.com>
>>> Date:   Mon Mar 17 15:40:11 2025 +0100
>>>
>>>     x86/msi: always propagate MSI writes when not in active system mode
>>>     
>>>     Relax the limitation on MSI register writes, and only apply it when the
>>>     system is in active state.  For example AMD IOMMU drivers rely on using
>>>     set_msi_affinity() to force an MSI register write on resume from
>>>     suspension.
>>>     
>>>     The original patch intention was to reduce the number of MSI register
>>>     writes when the system is in active state.  Leave the other states to
>>>     always perform the writes, as it's safer given the existing code, and 
>>> it's
>>>     expected to not make a difference performance wise.
>>>     
>>>     Reported-by: Marek Marczykowski-Górecki 
>>> <marma...@invisiblethingslab.com>
>>>     Fixes: ('8e60d47cf011 x86/iommu: avoid MSI address and data writes if 
>>> IRT index hasn't changed')
>>>     Signed-off-by: Roger Pau Monné <roger....@citrix.com>
>>>
>>> --- a/xen/arch/x86/msi.c
>>> +++ b/xen/arch/x86/msi.c
>>> @@ -189,6 +189,15 @@ static int write_msi_msg(struct msi_desc *entry, 
>>> struct msi_msg *msg,
>>>  {
>>>      entry->msg = *msg;
>>>  
>>> +    if ( unlikely(system_state != SYS_STATE_active) )
>>
>> ... this condition as needed. Conceivably there might be cases where even
>> during resume writes aren't always necessary to propagate to hardware.
> 
> Isn't this going to be quite more cumbersome, and I don't think the
> cases outside of active mode are very interesting or relevant
> performance wise?

Certainly. As to cumbersome - having the conditions at the call sites
would also serve as kind of documentation. That said, this was just a
suggestion; I'm not going to insist. I see from other traffic on this
thread the updated patch is now fully addressing the regression. The
secondary change, if not broken out to a separate patch, would want
mentioning in the description, though (which may have been the plan
anyway).

Jan

> The main purpose of the original change was the reduce the number of
> PCI accesses during 'active' state, as every time an interrupt is
> migrated to a different CPU such a (possibly redundant) PCI config
> write would happen.
> 
> Thanks, Roger.


Reply via email to