Trying to pick this thread up from the discussion "Debugging Windows HVM
crashes on Ryzen 3xxx series CPUs."  I'm trying to summarize what I see
claimed, and my understanding of things, and am not necessarily speaking
as an authority, so please correct me where I'm wrong.

Modern Windows guests (at least Windows 10 and Windows Server 2016)
crash when running under Xen on AMD Ryzen 3xxx desktop-class cpus (but
not the corresponding server cpus).  This is true for all upstream
releases of Xen (i.e., has nothing to do with Xen 4.13 in particular).

Linux systems seem to work just fine.

It seems that reverting patch ca2eee92df44 (from Xen 3.4!) fixes the
issue for Steven.  This would seem to indicate that Windows running on
such systems is confused by the topology information presented to Xen.

A "proper fix" for this would involve presenting a coherent, rational
topology to guests, which in turn relies on the long-awaited CPUID
infrastructure, all of which is way out of scope for being fixed by the
4.13 release.

The revert in question *only* touches code in libxc; no Xen-side changes
are required.

One issue that was raised was to do with migration.  But as upstream Xen
doesn't have cpu leveling (?), it's already not possible to migrate *to*
such a system from any other systems.  The main worry then would be
making sure that we deal properly with migrating *away* from such a
system to future systems in future versions.  But it's absolutely clear
that we can't simply apply the change across the board; it must only be
done on Ryzen 3xxx systems.

Given that, we have a couple of approaches we could take:

1. Document that Xen 4.13 doesn't work with Ryzen, and punt the issue to
4.14.

2. Try to figure out exactly which changes allow Windows to work, and
document that users should add those (temporarily) to xl.cfg files.  (If
setting these values is broken, this can be fixed.)

3. Have a libxl / xl flag indicating to apply the changes here as-is (or
with the minimal changes necessary)

4. Have an environment variable the user can set which will cause the
toolstack to do the above on versions that don't have a "proper"

5. If we can make outgoing migrations forward-compatible, then we could
think about automatically applying this feature only on the affected cpus.

Thoughts?

I think the first step should be to identify the minimum set of changes
that allow Windows to boot, and then see if we can't automatically apply
the changes in a forward-compatible manner (#5).  If we can't, then
trying to get existing configurations so that you can specify the right
bits is the next best option (#2); and having an environment variable
would be the final fal-back.

It shouldn't be terribly difficult, given the patch, to "bisect" the
minimum changes required to enable Windows guests to boot.  Who wants to
pick that up?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Reply via email to