Re: [PATCH 1/3] KVM: arm64: Fix S1PTW handling on RO memslots
On Sat, 24 Dec 2022 at 13:19, Marc Zyngier wrote: > > On Thu, 22 Dec 2022 13:01:55 +, > Ard Biesheuvel wrote: > > > > On Tue, 20 Dec 2022 at 21:09, Marc Zyngier wrote: > > > > > > A recent development on the EFI front has resulted in guests having > > > their page tables baked in the firmware binary, and mapped into > > > the IPA space as part as a read-only memslot. > > > > > > Not only this is legitimate, but it also results in added security, > > > so thumbs up. However, this clashes mildly with our handling of a S1PTW > > > as a write to correctly handle AF/DB updates to the S1 PTs, and results > > > in the guest taking an abort it won't recover from (the PTs mapping the > > > vectors will suffer freom the same problem...). > > > > > > So clearly our handling is... wrong. > > > > > > Instead, switch to a two-pronged approach: > > > > > > - On S1PTW translation fault, handle the fault as a read > > > > > > - On S1PTW permission fault, handle the fault as a write > > > > > > This is of no consequence to SW that *writes* to its PTs (the write > > > will trigger a non-S1PTW fault), and SW that uses RO PTs will not > > > use AF/DB anyway, as that'd be wrong. > > > > > > Only in the case described in c4ad98e4b72c ("KVM: arm64: Assume write > > > fault on S1PTW permission fault on instruction fetch") do we end-up > > > with two back-to-back faults (page being evicted and faulted back). > > > I don't think this is a case worth optimising for. > > > > > > Fixes: c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission > > > fault on instruction fetch") > > > Signed-off-by: Marc Zyngier > > > Cc: sta...@vger.kernel.org > > > > Reviewed-by: Ard Biesheuvel > > > > I have tested this patch on my TX2 with one of the EFI builds in > > question, and everything works as before (I never observed the issue > > itself) > > If you get the chance, could you try with non-4kB page sizes? Here, I > could only reproduce it with 16kB pages. It was firing like clockwork > on Cortex-A55 with that. > I'll try on 64k but I don't have access to a 16k capable machine that runs KVM atm (I'm still enjoying working wifi and GPU etc on my M1 Macbook Air) > > > > Regression-tested-by: Ard Biesheuvel > > > > For the record, the EFI build in question targets QEMU/mach-virt and > > switches to a set of read-only page tables in emulated NOR flash > > straight out of reset, so it can create and populate the real page > > tables with MMU and caches enabled. EFI does not use virtual memory or > > paging so managing access flags or dirty bits in hardware is unlikely > > to add any value, and it is not being used at the moment. And given > > that this is emulated NOR flash, any ordinary write to it tears down > > the r/o memslot altogether, and kicks the NOR flash emulation in QEMU > > into programming mode, which is fully based on MMIO emulation and does > > not use a memslot at all. IOW, even if we could figure out what store > > the PTW was attempting to do, it is always going to be rejected since > > the r/o page tables can only be modified by 'programming' the NOR > > flash sector. > > Indeed, and this would be a pretty dodgy setup anyway. > > Thanks for having had a look, > > M. > > -- > Without deviation from the norm, progress is not possible. ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 1/3] KVM: arm64: Fix S1PTW handling on RO memslots
On Thu, 22 Dec 2022 13:01:55 +, Ard Biesheuvel wrote: > > On Tue, 20 Dec 2022 at 21:09, Marc Zyngier wrote: > > > > A recent development on the EFI front has resulted in guests having > > their page tables baked in the firmware binary, and mapped into > > the IPA space as part as a read-only memslot. > > > > Not only this is legitimate, but it also results in added security, > > so thumbs up. However, this clashes mildly with our handling of a S1PTW > > as a write to correctly handle AF/DB updates to the S1 PTs, and results > > in the guest taking an abort it won't recover from (the PTs mapping the > > vectors will suffer freom the same problem...). > > > > So clearly our handling is... wrong. > > > > Instead, switch to a two-pronged approach: > > > > - On S1PTW translation fault, handle the fault as a read > > > > - On S1PTW permission fault, handle the fault as a write > > > > This is of no consequence to SW that *writes* to its PTs (the write > > will trigger a non-S1PTW fault), and SW that uses RO PTs will not > > use AF/DB anyway, as that'd be wrong. > > > > Only in the case described in c4ad98e4b72c ("KVM: arm64: Assume write > > fault on S1PTW permission fault on instruction fetch") do we end-up > > with two back-to-back faults (page being evicted and faulted back). > > I don't think this is a case worth optimising for. > > > > Fixes: c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission > > fault on instruction fetch") > > Signed-off-by: Marc Zyngier > > Cc: sta...@vger.kernel.org > > Reviewed-by: Ard Biesheuvel > > I have tested this patch on my TX2 with one of the EFI builds in > question, and everything works as before (I never observed the issue > itself) If you get the chance, could you try with non-4kB page sizes? Here, I could only reproduce it with 16kB pages. It was firing like clockwork on Cortex-A55 with that. > > Regression-tested-by: Ard Biesheuvel > > For the record, the EFI build in question targets QEMU/mach-virt and > switches to a set of read-only page tables in emulated NOR flash > straight out of reset, so it can create and populate the real page > tables with MMU and caches enabled. EFI does not use virtual memory or > paging so managing access flags or dirty bits in hardware is unlikely > to add any value, and it is not being used at the moment. And given > that this is emulated NOR flash, any ordinary write to it tears down > the r/o memslot altogether, and kicks the NOR flash emulation in QEMU > into programming mode, which is fully based on MMIO emulation and does > not use a memslot at all. IOW, even if we could figure out what store > the PTW was attempting to do, it is always going to be rejected since > the r/o page tables can only be modified by 'programming' the NOR > flash sector. Indeed, and this would be a pretty dodgy setup anyway. Thanks for having had a look, M. -- Without deviation from the norm, progress is not possible. ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 2/3] KVM: arm64: Handle S1PTW translation with TCR_HA set as a write
On Thu, 22 Dec 2022 20:58:40 +, Oliver Upton wrote: > > On Thu, Dec 22, 2022 at 09:01:15AM +, Marc Zyngier wrote: > > On Wed, 21 Dec 2022 17:46:24 +, Oliver Upton > > wrote: > > > - When UFFD is in use, translation faults are reported to userspace as > > >writes when from a RW memslot and reads when from an RO memslot. > > > > Not quite: translation faults are reported as reads if TCR_EL1.HA > > isn't set, and as writes if it is. Ignoring TCR_EL1.HD for a moment, > > this matches exactly the behaviour of the page-table walker, which > > will update the S1 PTs only if this bit is set. > > My bad, yes you're right. I conflated the use case here with the > architectural state. > > I'm probably being way too pedantic, but I just wanted to make sure we > agree about the ensuing subtlety. More below: > > > Or is it what userfaultfd does on its own? That'd be confusing... > > > > > > > > - S1 page table memory is spuriously marked as dirty, as we presume a > > >write immediately follows the translation fault. That isn't entirely > > >senseless, as it would mean both the target page and the S1 PT that > > >maps it are both old. This is nothing new I suppose, just weird. > > > > s/old/young/ ? > > > > I think you're confusing the PT access with the access that caused the > > PT access (I'll have that printed on a t-shirt, thank you very much). > > I'd buy it! > > > Here, we're not considering the cause of the PT access anymore. If > > TCR_EL1.HA is set, the S1 PT page will be marked as accessed even on a > > read, and only that page. > > I think this is where the disconnect might be. TCR_EL1.HA == 1 suggests > a write could possibly follow, but I don't think it requires it. The > page table walker must first load the S1 PTE before writing to it. Ah, you're talking of the write to the PTE. Too many writes! My reasoning is based on Rule LFTXR in DDI0487I.a, which says: "When the PE performs a hardware update of the AF, it sets the AF to 1 in the corresponding descriptor in memory, in a coherent manner, using an atomic read-modify-write of that descriptor." An atomic-or operation fits this description, and I cannot see anything in the architecture that would prevent the write of a PTE even if AF is already set, such as mandating something like a test-and-set or compare-and-swap. I'm not saying this is the only possible implementation, or even a good one. But I don't think this is incompatible with what the architecture mandates. > > From AArch64.S1Translate() (DDI0487H.a): > > (fault, descaddress, walkstate, descriptor) = AArch64.S1Walk(fault, > walkparams, va, regime, >ss, acctype, > iswrite, ispriv); > > [...] > > new_desc = descriptor; > if walkparams.ha == '1' && AArch64.FaultAllowsSetAccessFlag(fault) then > // Set descriptor AF bit > new_desc<10> = '1'; > > [...] > > // Either the access flag was clear or AP<2> is set > if new_desc != descriptor then > if regime == Regime_EL10 && EL2Enabled() then > s1aarch64 = TRUE; > s2fs1walk = TRUE; > aligned = TRUE; > iswrite = TRUE; > (s2fault, descupdateaddress) = AArch64.S2Translate(fault, descaddress, > s1aarch64, > ss, s2fs1walk, > AccType_ATOMICRW, > aligned, iswrite, > ispriv); > > if s2fault.statuscode != Fault_None then > return (s2fault, AddressDescriptor UNKNOWN); > else > descupdateaddress = descaddress; > > (fault, mem_desc) = AArch64.MemSwapTableDesc(fault, descriptor, new_desc, >walkparams.ee, > descupdateaddress) > > Buried in AArch64.S1Walk() is a stage-2 walk for a read to fetch the > descriptor. The second stage-2 walk for write is conditioned on having > already fetched the stage-1 descriptor and determining the AF needs > to be set. The question is whether this is one possible implementation, or the only possible implementation. My bet is on the former. > Relating back to UFFD: if we expect KVM to do exactly what hardware > does, UFFD should see an attempted read when the first walk fails > because of an S2 translation fault. Based on this patch, though, we'd > promote it to a write if TCR_EL1.HA == 1. > > This has the additional nuance of marking the S1 PT's IPA as dirty, even > though it might not actually have been written to. Having said that, > the false positive rate should be negligible given that S1 PTs ought to > account for a small amount of guest memory. > > Like I said before, I'm probably being unnecessarily pedantic :) It just > seems to me that the view we're giving userspace of S1PTW aborts isn't > exactly architectural and I want to make sure that is explicitly > intentional. I think it is perfectly fine to be pedantic about these things,
Re: [PATCH 12/14] KVM: selftests: Use wildcards to find library source files
On 12/13/22 01:16, Sean Christopherson wrote: Use $(wildcard ...) to find the library source files instead of manually defining the inputs, which is a maintenance burden and error prone. No, please don't. This leads to weird errors, for example when "git checkout" is interrupted with ^C. I have applied patches 1-11. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 10/14] KVM: selftests: Include lib.mk before consuming $(CC)
On 12/13/22 01:16, Sean Christopherson wrote: Include lib.mk before consuming $(CC) and document that lib.mk overwrites $(CC) unless make was invoked with -e or $(CC) was specified after make (which apparently makes the environment override the Makefile?!?!). Yes, it does. In projects that don't use configure or similar, you might have seen CFLAGS = -O2 -g to be overridden with "make CFLAGS=-g" if optimization is undesirable. Paolo Including lib.mk after using it for probing, e.g. for -no-pie, can lead to weirdness. ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 09/14] KVM: selftests: Explicitly disable builtins for mem*() overrides
On 12/13/22 01:16, Sean Christopherson wrote: Explicitly disable the compiler's builtin memcmp(), memcpy(), and memset(). Because only lib/string_override.c is built with -ffreestanding, the compiler reserves the right to do what it wants and can try to link the non-freestanding code to its own crud. /usr/bin/x86_64-linux-gnu-ld: /lib/x86_64-linux-gnu/libc.a(memcmp.o): in function `memcmp_ifunc': (.text+0x0): multiple definition of `memcmp'; tools/testing/selftests/kvm/lib/string_override.o: tools/testing/selftests/kvm/lib/string_override.c:15: first defined here clang: error: linker command failed with exit code 1 (use -v to see invocation) Hmm, that's weird though. I think it's an effect of ifunc and maybe even a linker bug. The patch makes sense anyway. Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 06/14] KVM: selftests: Rename UNAME_M to ARCH_DIR, fill explicitly for x86
On 12/13/22 01:16, Sean Christopherson wrote: -ifeq ($(ARCH),riscv) - UNAME_M := riscv +ifeq ($(ARCH),x86) + ARCH_DIR := x86_64 +else ifeq ($(ARCH),arm64) + ARCH_DIR := aarch64 +else ifeq ($(ARCH),s390) + ARCH_DIR := s390x +else ifeq ($(ARCH),riscv) + ARCH_DIR := riscv +else +$(error Unknown architecture '$(ARCH)') endif $(error) would break compiling via tools/testing/selftests/Makefile, so I am squashing this: diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index d761a77c3a80..59f3eb53c932 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -13,10 +13,8 @@ else ifeq ($(ARCH),arm64) ARCH_DIR := aarch64 else ifeq ($(ARCH),s390) ARCH_DIR := s390x -else ifeq ($(ARCH),riscv) - ARCH_DIR := riscv else -$(error Unknown architecture '$(ARCH)') + ARCH_DIR := $(ARCH) endif LIBKVM += lib/assert.c Then the aarch64 and s390x directories can be renamed---x86 too, but the ifeq needs to stay (just changed to do x86_64->x86 instead of the other way round). Paolo ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm