On 01/11/2021 10:53, Ian Jackson wrote:
> Andrew Cooper writes ("[PATCH] x86/kexec: Fix crash on transition to a 32bit 
> kernel on AMD hardware"):
>> The `ljmp *mem` instruction is (famously?) not binary compatible between 
>> Intel
>> and AMD CPUS.  The AMD-compatible version would require .long to be .quad in
>> the second hunk.
>>
>> Switch to using lretq, which is compatible between Intel and AMD, as well as
>> being less logic overall.
>>
>> Fixes: 5a82d5cf352d ("kexec: extend hypercall with improved load/unload ops")
>> Signed-off-by: Andrew Cooper <[email protected]>
>> ---
>> CC: Jan Beulich <[email protected]>
>> CC: Roger Pau Monné <[email protected]>
>> CC: Wei Liu <[email protected]>
>> CC: Ian Jackson <[email protected]>
>>
>> For 4.16.  This is a bugfix for rare (so rare it has probably never been
>> exercised) but plain-broken usecase.
>>
>> One argument against taking it says that this has been broken for 8 years
>> already, so what's a few extra weeks.  Another is that this patch is only
>> compile tested because I don't have a suitable setup to repro, nor the time 
>> to
>> try organising one.
> Thanks for being frank about testing.
>
> The bug is a ?race? ?  Which hardly ever happens ?  Or it only affects
> some strange configurations ?  Or ... ?

Strange configuration.

On AMD hardware, if you try to use a 32bit crash kernel, then Xen will
unconditionally crash when trying to transition to it.

Any other scenario (Intel hardware, or a 64bit crash kernel) will work
fine and without incident.

>> On the other hand, I specifically used the point of binary incompatibility to
>> persuade Intel to drop Call Gates out of the architecture in the forthcoming
>> FRED spec.
> I'm afraid I can't make head or tail of this.  What are the
> implications ?

I managed to get some CPU architects to agree that there was a binary
incompatibility here.

>> The lretq pattern used here matches x86_32_switch() in
>> xen/arch/x86/boot/head.S, and this codepath is executed on every MB2+EFI
>> xen.gz boot, which from XenServer alone is a very wide set of testing.
> AIUI this is an argument saying that the basic principle of this
> change is good.  Good.
>
> However: is there some risk of a non-catastrophic breakage here, for
> example, if there was a slip in the actual implementation ?
> (Catastrophic breakage would break all our tests, I think.)

This path is only taken for a 32bit crash kernel.  It is not taken for
64bit crash kernels, or they wouldn't work on AMD either, and this is
something we test routinely in XenServer.

The worst that can happen is that I've messed the lretq pattern up, and
broken transition to all 32bit crash kernels, irrespective of hardware
vendor.

It will either function correctly, or explode.  If it is broken, it
won't be subtle, or dependent on the phase of the moon/etc.

~Andrew


Reply via email to