Re: [PATCH 1/3] KVM: arm64: Fix S1PTW handling on RO memslots

2022-12-24 Thread Ard Biesheuvel
On Sat, 24 Dec 2022 at 13:19, Marc Zyngier  wrote:
>
> On Thu, 22 Dec 2022 13:01:55 +,
> Ard Biesheuvel  wrote:
> >
> > On Tue, 20 Dec 2022 at 21:09, Marc Zyngier  wrote:
> > >
> > > A recent development on the EFI front has resulted in guests having
> > > their page tables baked in the firmware binary, and mapped into
> > > the IPA space as part as a read-only memslot.
> > >
> > > Not only this is legitimate, but it also results in added security,
> > > so thumbs up. However, this clashes mildly with our handling of a S1PTW
> > > as a write to correctly handle AF/DB updates to the S1 PTs, and results
> > > in the guest taking an abort it won't recover from (the PTs mapping the
> > > vectors will suffer freom the same problem...).
> > >
> > > So clearly our handling is... wrong.
> > >
> > > Instead, switch to a two-pronged approach:
> > >
> > > - On S1PTW translation fault, handle the fault as a read
> > >
> > > - On S1PTW permission fault, handle the fault as a write
> > >
> > > This is of no consequence to SW that *writes* to its PTs (the write
> > > will trigger a non-S1PTW fault), and SW that uses RO PTs will not
> > > use AF/DB anyway, as that'd be wrong.
> > >
> > > Only in the case described in c4ad98e4b72c ("KVM: arm64: Assume write
> > > fault on S1PTW permission fault on instruction fetch") do we end-up
> > > with two back-to-back faults (page being evicted and faulted back).
> > > I don't think this is a case worth optimising for.
> > >
> > > Fixes: c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission 
> > > fault on instruction fetch")
> > > Signed-off-by: Marc Zyngier 
> > > Cc: sta...@vger.kernel.org
> >
> > Reviewed-by: Ard Biesheuvel 
> >
> > I have tested this patch on my TX2 with one of the EFI builds in
> > question, and everything works as before (I never observed the issue
> > itself)
>
> If you get the chance, could you try with non-4kB page sizes? Here, I
> could only reproduce it with 16kB pages. It was firing like clockwork
> on Cortex-A55 with that.
>

I'll try on 64k but I don't have access to a 16k capable machine that
runs KVM atm (I'm still enjoying working wifi and GPU etc on my M1
Macbook Air)

> >
> > Regression-tested-by: Ard Biesheuvel 
> >
> > For the record, the EFI build in question targets QEMU/mach-virt and
> > switches to a set of read-only page tables in emulated NOR flash
> > straight out of reset, so it can create and populate the real page
> > tables with MMU and caches enabled. EFI does not use virtual memory or
> > paging so managing access flags or dirty bits in hardware is unlikely
> > to add any value, and it is not being used at the moment. And given
> > that this is emulated NOR flash, any ordinary write to it tears down
> > the r/o memslot altogether, and kicks the NOR flash emulation in QEMU
> > into programming mode, which is fully based on MMIO emulation and does
> > not use a memslot at all. IOW, even if we could figure out what store
> > the PTW was attempting to do, it is always going to be rejected since
> > the r/o page tables can only be modified by 'programming' the NOR
> > flash sector.
>
> Indeed, and this would be a pretty dodgy setup anyway.
>
> Thanks for having had a look,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/3] KVM: arm64: Fix S1PTW handling on RO memslots

2022-12-24 Thread Marc Zyngier
On Thu, 22 Dec 2022 13:01:55 +,
Ard Biesheuvel  wrote:
> 
> On Tue, 20 Dec 2022 at 21:09, Marc Zyngier  wrote:
> >
> > A recent development on the EFI front has resulted in guests having
> > their page tables baked in the firmware binary, and mapped into
> > the IPA space as part as a read-only memslot.
> >
> > Not only this is legitimate, but it also results in added security,
> > so thumbs up. However, this clashes mildly with our handling of a S1PTW
> > as a write to correctly handle AF/DB updates to the S1 PTs, and results
> > in the guest taking an abort it won't recover from (the PTs mapping the
> > vectors will suffer freom the same problem...).
> >
> > So clearly our handling is... wrong.
> >
> > Instead, switch to a two-pronged approach:
> >
> > - On S1PTW translation fault, handle the fault as a read
> >
> > - On S1PTW permission fault, handle the fault as a write
> >
> > This is of no consequence to SW that *writes* to its PTs (the write
> > will trigger a non-S1PTW fault), and SW that uses RO PTs will not
> > use AF/DB anyway, as that'd be wrong.
> >
> > Only in the case described in c4ad98e4b72c ("KVM: arm64: Assume write
> > fault on S1PTW permission fault on instruction fetch") do we end-up
> > with two back-to-back faults (page being evicted and faulted back).
> > I don't think this is a case worth optimising for.
> >
> > Fixes: c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission 
> > fault on instruction fetch")
> > Signed-off-by: Marc Zyngier 
> > Cc: sta...@vger.kernel.org
> 
> Reviewed-by: Ard Biesheuvel 
> 
> I have tested this patch on my TX2 with one of the EFI builds in
> question, and everything works as before (I never observed the issue
> itself)

If you get the chance, could you try with non-4kB page sizes? Here, I
could only reproduce it with 16kB pages. It was firing like clockwork
on Cortex-A55 with that.

> 
> Regression-tested-by: Ard Biesheuvel 
> 
> For the record, the EFI build in question targets QEMU/mach-virt and
> switches to a set of read-only page tables in emulated NOR flash
> straight out of reset, so it can create and populate the real page
> tables with MMU and caches enabled. EFI does not use virtual memory or
> paging so managing access flags or dirty bits in hardware is unlikely
> to add any value, and it is not being used at the moment. And given
> that this is emulated NOR flash, any ordinary write to it tears down
> the r/o memslot altogether, and kicks the NOR flash emulation in QEMU
> into programming mode, which is fully based on MMIO emulation and does
> not use a memslot at all. IOW, even if we could figure out what store
> the PTW was attempting to do, it is always going to be rejected since
> the r/o page tables can only be modified by 'programming' the NOR
> flash sector.

Indeed, and this would be a pretty dodgy setup anyway.

Thanks for having had a look,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 2/3] KVM: arm64: Handle S1PTW translation with TCR_HA set as a write

2022-12-24 Thread Marc Zyngier
On Thu, 22 Dec 2022 20:58:40 +,
Oliver Upton  wrote:
> 
> On Thu, Dec 22, 2022 at 09:01:15AM +, Marc Zyngier wrote:
> > On Wed, 21 Dec 2022 17:46:24 +, Oliver Upton  
> > wrote:
> > >  - When UFFD is in use, translation faults are reported to userspace as
> > >writes when from a RW memslot and reads when from an RO memslot.
> > 
> > Not quite: translation faults are reported as reads if TCR_EL1.HA
> > isn't set, and as writes if it is. Ignoring TCR_EL1.HD for a moment,
> > this matches exactly the behaviour of the page-table walker, which
> > will update the S1 PTs only if this bit is set.
> 
> My bad, yes you're right. I conflated the use case here with the
> architectural state.
> 
> I'm probably being way too pedantic, but I just wanted to make sure we
> agree about the ensuing subtlety. More below:
> 
> > Or is it what userfaultfd does on its own? That'd be confusing...
> > 
> > > 
> > >  - S1 page table memory is spuriously marked as dirty, as we presume a
> > >write immediately follows the translation fault. That isn't entirely
> > >senseless, as it would mean both the target page and the S1 PT that
> > >maps it are both old. This is nothing new I suppose, just weird.
> > 
> > s/old/young/ ?
> > 
> > I think you're confusing the PT access with the access that caused the
> > PT access (I'll have that printed on a t-shirt, thank you very much).
> 
> I'd buy it!
> 
> > Here, we're not considering the cause of the PT access anymore. If
> > TCR_EL1.HA is set, the S1 PT page will be marked as accessed even on a
> > read, and only that page.
> 
> I think this is where the disconnect might be. TCR_EL1.HA == 1 suggests
> a write could possibly follow, but I don't think it requires it. The
> page table walker must first load the S1 PTE before writing to it.

Ah, you're talking of the write to the PTE. Too many writes!

My reasoning is based on Rule LFTXR in DDI0487I.a, which says:

"When the PE performs a hardware update of the AF, it sets the AF to 1
 in the corresponding descriptor in memory, in a coherent manner,
 using an atomic read-modify-write of that descriptor."

An atomic-or operation fits this description, and I cannot see
anything in the architecture that would prevent the write of a PTE
even if AF is already set, such as mandating something like a
test-and-set or compare-and-swap.

I'm not saying this is the only possible implementation, or even a
good one. But I don't think this is incompatible with what the
architecture mandates.

> 
> From AArch64.S1Translate() (DDI0487H.a):
> 
> (fault, descaddress, walkstate, descriptor) = AArch64.S1Walk(fault, 
> walkparams, va, regime,
>ss, acctype, 
> iswrite, ispriv);
> 
> [...]
> 
> new_desc = descriptor;
> if walkparams.ha == '1' && AArch64.FaultAllowsSetAccessFlag(fault) then
>   // Set descriptor AF bit
>   new_desc<10> = '1';
> 
> [...]
> 
> // Either the access flag was clear or AP<2> is set
> if new_desc != descriptor then
>   if regime == Regime_EL10 && EL2Enabled() then
> s1aarch64 = TRUE;
>   s2fs1walk = TRUE;
>   aligned = TRUE;
>   iswrite = TRUE;
>   (s2fault, descupdateaddress) = AArch64.S2Translate(fault, descaddress, 
> s1aarch64,
>  ss, s2fs1walk, 
> AccType_ATOMICRW,
>  aligned, iswrite, 
> ispriv);
> 
> if s2fault.statuscode != Fault_None then
>   return (s2fault, AddressDescriptor UNKNOWN);
> else
>   descupdateaddress = descaddress;
> 
> (fault, mem_desc) = AArch64.MemSwapTableDesc(fault, descriptor, new_desc,
>walkparams.ee, 
> descupdateaddress)
> 
> Buried in AArch64.S1Walk() is a stage-2 walk for a read to fetch the
> descriptor. The second stage-2 walk for write is conditioned on having
> already fetched the stage-1 descriptor and determining the AF needs
> to be set.

The question is whether this is one possible implementation, or the
only possible implementation. My bet is on the former.

> Relating back to UFFD: if we expect KVM to do exactly what hardware
> does, UFFD should see an attempted read when the first walk fails
> because of an S2 translation fault. Based on this patch, though, we'd
> promote it to a write if TCR_EL1.HA == 1.
> 
> This has the additional nuance of marking the S1 PT's IPA as dirty, even
> though it might not actually have been written to. Having said that,
> the false positive rate should be negligible given that S1 PTs ought to
> account for a small amount of guest memory.
> 
> Like I said before, I'm probably being unnecessarily pedantic :) It just
> seems to me that the view we're giving userspace of S1PTW aborts isn't
> exactly architectural and I want to make sure that is explicitly
> intentional.

I think it is perfectly fine to be pedantic about these things,

Re: [PATCH 12/14] KVM: selftests: Use wildcards to find library source files

2022-12-24 Thread Paolo Bonzini

On 12/13/22 01:16, Sean Christopherson wrote:

Use $(wildcard ...) to find the library source files instead of manually
defining the inputs, which is a maintenance burden and error prone.


No, please don't.  This leads to weird errors, for example when "git 
checkout" is interrupted with ^C.   I have applied patches 1-11.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 10/14] KVM: selftests: Include lib.mk before consuming $(CC)

2022-12-24 Thread Paolo Bonzini

On 12/13/22 01:16, Sean Christopherson wrote:

Include lib.mk before consuming $(CC) and document that lib.mk overwrites
$(CC) unless make was invoked with -e or $(CC) was specified after make
(which apparently makes the environment override the Makefile?!?!).


Yes, it does.  In projects that don't use configure or similar, you 
might have seen


CFLAGS = -O2 -g

to be overridden with "make CFLAGS=-g" if optimization is undesirable.

Paolo


Including lib.mk after using it for probing, e.g. for -no-pie, can lead
to weirdness.


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 09/14] KVM: selftests: Explicitly disable builtins for mem*() overrides

2022-12-24 Thread Paolo Bonzini

On 12/13/22 01:16, Sean Christopherson wrote:

Explicitly disable the compiler's builtin memcmp(), memcpy(), and
memset().  Because only lib/string_override.c is built with -ffreestanding,
the compiler reserves the right to do what it wants and can try to link the
non-freestanding code to its own crud.

   /usr/bin/x86_64-linux-gnu-ld: /lib/x86_64-linux-gnu/libc.a(memcmp.o): in 
function `memcmp_ifunc':
   (.text+0x0): multiple definition of `memcmp'; 
tools/testing/selftests/kvm/lib/string_override.o:
   tools/testing/selftests/kvm/lib/string_override.c:15: first defined here
   clang: error: linker command failed with exit code 1 (use -v to see 
invocation)


Hmm, that's weird though.  I think it's an effect of ifunc and maybe 
even a linker bug.  The patch makes sense anyway.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 06/14] KVM: selftests: Rename UNAME_M to ARCH_DIR, fill explicitly for x86

2022-12-24 Thread Paolo Bonzini

On 12/13/22 01:16, Sean Christopherson wrote:

-ifeq ($(ARCH),riscv)
-   UNAME_M := riscv
+ifeq ($(ARCH),x86)
+   ARCH_DIR := x86_64
+else ifeq ($(ARCH),arm64)
+   ARCH_DIR := aarch64
+else ifeq ($(ARCH),s390)
+   ARCH_DIR := s390x
+else ifeq ($(ARCH),riscv)
+   ARCH_DIR := riscv
+else
+$(error Unknown architecture '$(ARCH)')
  endif


$(error) would break compiling via tools/testing/selftests/Makefile, so 
I am squashing this:


diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile

index d761a77c3a80..59f3eb53c932 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -13,10 +13,8 @@ else ifeq ($(ARCH),arm64)
ARCH_DIR := aarch64
 else ifeq ($(ARCH),s390)
ARCH_DIR := s390x
-else ifeq ($(ARCH),riscv)
-   ARCH_DIR := riscv
 else
-$(error Unknown architecture '$(ARCH)')
+   ARCH_DIR := $(ARCH)
 endif

 LIBKVM += lib/assert.c

Then the aarch64 and s390x directories can be renamed---x86 too, but the 
ifeq needs to stay (just changed to do x86_64->x86 instead of the other 
way round).


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm