On 11/18/22 16:10, Jason Andryuk wrote:
On Fri, Nov 18, 2022 at 12:22 PM Andrew Cooper
<andrew.coop...@citrix.com> wrote:
On 18/11/2022 14:39, Roger Pau Monne wrote:
Nov 18 01:55:11.753936 (XEN) arch/x86/mm/hap/hap.c:304: d1 failed to allocate
from HAP pool
Nov 18 01:55:18.633799 (XEN) Failed to shatter gfn 7ed37: -12
Nov 18 01:55:18.633866 (XEN) d1v0 EPT violation 0x19c (--x/rw-) gpa
0x0000007ed373a1 mfn 0x33ed37 type 0
Nov 18 01:55:18.645790 (XEN) d1v0 Walking EPT tables for GFN 7ed37:
Nov 18 01:55:18.645850 (XEN) d1v0 epte 9c0000047eba3107
Nov 18 01:55:18.645893 (XEN) d1v0 epte 9c000003000003f3
Nov 18 01:55:18.645935 (XEN) d1v0 --- GLA 0x7ed373a1
Nov 18 01:55:18.657783 (XEN) domain_crash called from
arch/x86/hvm/vmx/vmx.c:3758
Nov 18 01:55:18.657844 (XEN) Domain 1 (vcpu#0) crashed on cpu#8:
Nov 18 01:55:18.669781 (XEN) ----[ Xen-4.17-rc x86_64 debug=y Not tainted
]----
Nov 18 01:55:18.669843 (XEN) CPU: 8
Nov 18 01:55:18.669884 (XEN) RIP: 0020:[<000000007ed373a1>]
Nov 18 01:55:18.681711 (XEN) RFLAGS: 0000000000010002 CONTEXT: hvm guest
(d1v0)
Nov 18 01:55:18.681772 (XEN) rax: 000000007ed373a1 rbx: 000000007ed3726c
rcx: 0000000000000000
Nov 18 01:55:18.693713 (XEN) rdx: 000000007ed2e610 rsi: 0000000000008e38
rdi: 000000007ed37448
Nov 18 01:55:18.693775 (XEN) rbp: 0000000001b410a0 rsp: 0000000000320880
r8: 0000000000000000
Nov 18 01:55:18.705725 (XEN) r9: 0000000000000000 r10: 0000000000000000
r11: 0000000000000000
Nov 18 01:55:18.717733 (XEN) r12: 0000000000000000 r13: 0000000000000000
r14: 0000000000000000
Nov 18 01:55:18.717794 (XEN) r15: 0000000000000000 cr0: 0000000000000011
cr4: 0000000000000000
Nov 18 01:55:18.729713 (XEN) cr3: 0000000000400000 cr2: 0000000000000000
Nov 18 01:55:18.729771 (XEN) fsb: 0000000000000000 gsb: 0000000000000000
gss: 0000000000000002
Nov 18 01:55:18.741711 (XEN) ds: 0028 es: 0028 fs: 0000 gs: 0000 ss:
0028 cs: 0020
It seems to be related to the paging pool adding Andrew and Henry so
that he is aware.
Summary of what I've just given on IRC/Matrix.
This crash is caused by two things. First
(XEN) FLASK: Denying unknown domctl: 86.
because I completely forgot to wire up Flask for the new hypercalls.
But so did the original XSA-409 fix (as SECCLASS_SHADOW is behind
CONFIG_X86), so I don't feel quite as bad about this.
Broken for ARM, but not for x86, right?
I think SECCLASS_SHADOW is available in the policy bits - it's just
whether or not the hook functions are available?
And second because libxl ignores the error it gets back, and blindly
continues onward. Anthony has posted "libs/light: Propagate
libxl__arch_domain_create() return code" to fix the libxl half of the
bug, and I posted a second libxl bugfix to fix an error message. Both
are very simple.
For Flask, we need new access vectors because this is a common
hypercall, but I'm unsure how to interlink it with x86's shadow
control. This will require a bit of pondering, but it is probably
easier to just leave them unlinked.
It sort of seems like it could go under domain2 since domain/domain2
have most of the memory stuff, but it is non-PV. shadow has its own
set of hooks. It could go in hvm which already has some memory stuff.
Since the new hypercall is for managing a memory pool for any domain,
though HVM is the only one supported today, imho it belongs under
domain/domain2.
Something to consider is that there is another guest memory pool that is
managed, the PoD pool, which has a dedicated privilege for it. This
leads me to the question of whether access to manage the PoD pool and
the paging pool size should be separate accesses or whether they should
be under the same access. IMHO I believe it should be the latter as I
can see no benefit in disaggregating access to the PoD pool and the
paging pool. In fact I find myself thinking in terms of should the
managing domain have control over the size of any backing memory pools
for the target domain. I am not seeing any benefit to discriminating
between which backing memory pool a managing domain should be able to
manage. With that said, I am open to being convinced otherwise.
Since this is an XSA fix that will be backported, moving get/set PoD
hypercalls under a new permission would be too disruptive. I would
recommend introducing the permission set/getmempools under the domain
access vector, which will only control access to the paging pool. Then
planning can occur for 4.18 to look at transitioning get/set PoD target
to being controlled via get/setmempools.
Flask is listed as experimental which means it doesn't technically
matter if we break it, but it is used by OpenXT so not fixing it for
4.17 would be rather rude.
It's definitely nicer to have functional Flask in the release. OpenXT
can use a backport if necessary, so it doesn't need to be a release
blocker. Having said that, Flask is a nice feature of Xen, so it
would be good to have it functioning in 4.17.
As maintainer I would really prefer not to see 4.17 go out with any part
of XSM broken. While it is considered experimental, which I hope to
rectify, it is a long standing feature that has been kept stable, and
for which there is a sizeable user base. IMHO I think it deserves a
proper fix before release.
V/r,
Daniel P. Smith