RE: [PATCH] doc/arm: take care restore order of GICR_* in ITS restore

2021-07-21 Thread Jianyong Wu
Hello Marc,

> -Original Message-
> From: Marc Zyngier 
> Sent: Wednesday, July 21, 2021 5:54 PM
> To: Jianyong Wu 
> Cc: James Morse ; Andre Przywara
> ; lushenm...@huawei.com;
> k...@vger.kernel.org; kvmarm@lists.cs.columbia.edu; linux-
> d...@vger.kernel.org; linux-ker...@vger.kernel.org; Justin He
> 
> Subject: Re: [PATCH] doc/arm: take care restore order of GICR_* in ITS
> restore
>
> On Wed, 21 Jul 2021 10:20:19 +0100,
> Jianyong Wu  wrote:
> >
> > When restore GIC/ITS, GICR_CTLR must be restored after GICR_PROPBASER
> > and GICR_PENDBASER. That is important, as both of GICR_PROPBASER and
> > GICR_PENDBASER will fail to be loaded when lpi has enabled yet in
> > GICR_CTLR. Keep the restore order above will avoid that issue.
> > Shout it out at the doc is very helpful that may avoid lots of debug work.
>
> But that's something that is already mandated by the architecture, isn't it?
> See "5.1 LPIs" in the architecture spec:
>
> 
>
> If GICR_PROPBASER is updated when GICR_CTLR.EnableLPIs == 1, the effects
> are UNPREDICTABLE.
>
> [...]
>
> If GICR_PENDBASER is updated when GICR_CTLR.EnableLPIs == 1, the effects
> are UNPREDICTABLE.
>

I think this "UNPREDICTABLE" related with the "physical machine". Am I right?
In virtualization environment, kernel gives the definite answer that we should 
not enable GICR_CTLR.EnableLPIs before restoring GICR_PROPBASER(GICR_PENDBASER 
either)  when restore GIC ITS in VMM, see [1]. Thus, should we consider the 
virtualization environment as a special case?

[1] linux/arch/arm64/kvm/vgic/vgic-mmio-v3.c
static void vgic_mmio_write_propbase(struct kvm_vcpu *vcpu,
 gpa_t addr, unsigned int len,
 unsigned long val)
{
struct vgic_dist *dist = >kvm->arch.vgic;
struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
u64 old_propbaser, propbaser;

/* Storing a value with LPIs already enabled is undefined */
if (vgic_cpu->lpis_enabled)
   return;
...
}

Thanks
Jianyong

> 
>
> The point of this documentation is to make it explicit what is *not* covered
> by the architecture. Anything that is in the architecture still applies, and
> shouldn't be overlooked.
>
> Thanks,
>
>   M.
>
> --
> Without deviation from the norm, progress is not possible.
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 00/16] KVM: arm64: MMIO guard PV services

2021-07-21 Thread Andrew Jones
On Thu, Jul 15, 2021 at 05:31:43PM +0100, Marc Zyngier wrote:
> KVM/arm64 currently considers that any memory access outside of a
> memslot is a MMIO access. This so far has served us very well, but
> obviously relies on the guest trusting the host, and especially
> userspace to do the right thing.
> 
> As we keep on hacking away at pKVM, it becomes obvious that this trust
> model is not really fit for a confidential computing environment, and
> that the guest would require some guarantees that emulation only
> occurs on portions of the address space that have clearly been
> identified for this purpose.

This trust model is hard for me to reason about. userspace is trusted to
control the life cycle of the VM, to prepare the memslots for the VM,
and [presumably] identify what MMIO ranges are valid, yet it's not
trusted to handle invalid MMIO accesses. I'd like to learn more about
this model and the userspace involved.

> 
> This series aims at providing the two sides of the above coin:
> 
> - a set of PV services (collectively called 'MMIO guard' -- better
>   name required!) where the guest can flag portion of its address
>   space that it considers as MMIO, with map/unmap semantics. Any
>   attempt to access a MMIO range outside of these regions will result
>   in an external abort being injected.
> 
> - a set of hooks into the ioremap code allowing a Linux guest to tell
>   KVM about things it want to consider as MMIO. I definitely hate this
>   part of the series, as it feels clumsy and brittle.
> 
> For now, the enrolment in this scheme is controlled by a guest kernel
> command-line parameters, but it is expected that KVM will enforce this
> for protected VMs.
> 
> Note that this crucially misses a save/restore interface for non
> protected VMs, and I currently don't have a good solution for
> that. Ideas welcome.
> 
> I also plan to use this series as a base for some other purposes,
> namely to trick the guest in telling us how it maps things like
> prefetchable BARs (see the discussion at [1]). That part is not
> implemented yet, but there is already some provision to pass the MAIR
> index across.
> 
> Patches on top of 5.14-rc1, branch pushed at the usual location.
> 
> [1] 20210429162906.32742-1-sdonthin...@nvidia.com

The fun never stops.

Thanks,
drew

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 10/16] KVM: arm64: Add some documentation for the MMIO guard feature

2021-07-21 Thread Andrew Jones
On Thu, Jul 15, 2021 at 05:31:53PM +0100, Marc Zyngier wrote:
> Document the hypercalls user for the MMIO guard infrastructure.
> 
> Signed-off-by: Marc Zyngier 
> ---
>  Documentation/virt/kvm/arm/index.rst  |  1 +
>  Documentation/virt/kvm/arm/mmio-guard.rst | 73 +++
>  2 files changed, 74 insertions(+)
>  create mode 100644 Documentation/virt/kvm/arm/mmio-guard.rst
> 
> diff --git a/Documentation/virt/kvm/arm/index.rst 
> b/Documentation/virt/kvm/arm/index.rst
> index 78a9b670aafe..e77a0ee2e2d4 100644
> --- a/Documentation/virt/kvm/arm/index.rst
> +++ b/Documentation/virt/kvm/arm/index.rst
> @@ -11,3 +11,4 @@ ARM
> psci
> pvtime
> ptp_kvm
> +   mmio-guard
> diff --git a/Documentation/virt/kvm/arm/mmio-guard.rst 
> b/Documentation/virt/kvm/arm/mmio-guard.rst
> new file mode 100644
> index ..a5563a3e12cc
> --- /dev/null
> +++ b/Documentation/virt/kvm/arm/mmio-guard.rst
> @@ -0,0 +1,73 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==
> +KVM MMIO guard
> +==
> +
> +KVM implements device emulation by handling translation faults to any
> +IPA range that is not contained a memory slot. Such translation fault
  ^ in^ a

> +is in most cases passed on to userspace (or in rare cases to the host
> +kernel) with the address, size and possibly data of the access for
> +emulation.
> +
> +Should the guest exit with an address that is not one that corresponds
> +to an emulatable device, userspace may take measures that are not the
> +most graceful as far as the guest is concerned (such as terminating it
> +or delivering a fatal exception).
> +
> +There is also an element of trust: by forwarding the request to
> +userspace, the kernel asumes that the guest trusts userspace to do the

assumes
  
> +right thing.
> +
> +The KVM MMIO guard offers a way to mitigate this last point: a guest
> +can request that only certainly regions of the IPA space are valid as

certain

> +MMIO. Only these regions will be handled as an MMIO, and any other
> +will result in an exception being delivered to the guest.
> +
> +This relies on a set of hypercalls defined in the KVM-specific range,
> +using the HVC64 calling convention.
> +
> +* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_INFO
> +
> +==
> +Function ID:  (uint32)0xC602
> +Arguments:none
> +Return Values:(int64) NOT_SUPPORTED(-1) on error, or
> +  (uint64)Protection Granule (PG) size in
> +   bytes (r0)
> +==
> +
> +* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_ENROLL
> +
> +====
> +Function ID:  (uint32)0xC603
> +Arguments:none
> +Return Values:(int64) NOT_SUPPORTED(-1) on error, or
> +  RET_SUCCESS(0) (r0)
> +====
> +
> +* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_MAP
> +
> +====
> +Function ID:  (uint32)0xC604
> +Arguments:(uint64)The base of the PG-sized IPA range
> +  that is allowed to be accessed as
> +   MMIO. Must aligned to the PG size (r1)

align

> +  (uint64)Index in the MAIR_EL1 register
> +   providing the memory attribute that
> +   is used by the guest (r2)
> +Return Values:(int64) NOT_SUPPORTED(-1) on error, or
> +  RET_SUCCESS(0) (r0)
> +====
> +
> +* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_UNMAP
> +
> +====
> +Function ID:  (uint32)0xC604

copy+paste error, should be 0xC605

> +Arguments:(uint64)The base of the PG-sized IPA range
> +  that is forbidden to be accessed as

is now forbidden

or

was allowed

or just drop that part of the sentence because its covered by the "and
have been previously mapped" part. Something like

PG-sized IPA range aligned to the PG size which has been previously mapped
(r1)

> +   MMIO. Must aligned to the PG size

align

> +   and have been previously mapped (r1)
> +Return Values:(int64) NOT_SUPPORTED(-1) on error, or
> +  RET_SUCCESS(0) (r0)
> +====
> -- 
> 2.30.2
> 
> ___
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> 

Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size

2021-07-21 Thread Sean Christopherson
On Wed, Jul 21, 2021, Will Deacon wrote:
> > For the page tables liveliness, KVM implements mmu_notifier_ops.release, 
> > which is
> > invoked at the beginning of exit_mmap(), before the page tables are freed.  
> > In
> > its implementation, KVM takes mmu_lock and zaps all its shadow page tables, 
> > a.k.a.
> > the stage2 tables in KVM arm64.  The flow in question, 
> > get_user_mapping_size(),
> > also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
> > guaranteed to run with live userspace tables.
> 
> Unless I missed a case, exit_mmap() only runs when mm_struct::mm_users drops
> to zero, right?

Yep.

> The vCPU tasks should hold references to that afaict, so I don't think it
> should be possible for exit_mmap() to run while there are vCPUs running with
> the corresponding page-table.

Ah, right, I was thinking of non-KVM code that operated on the page tables 
without
holding a reference to mm_users.

> > Looking at the arm64 code, one thing I'm not clear on is whether arm64 
> > correctly
> > handles the case where exit_mmap() wins the race.  The invalidate_range 
> > hooks will
> > still be called, so userspace page tables aren't a problem, but
> > kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt 
> > without
> > any additional notifications that I see.  x86 deals with this by ensuring 
> > its
> > top-level TDP entry (stage2 equivalent) is valid while the page fault 
> > handler is
> > running.
> 
> But the fact that x86 handles this race has me worried. What am I missing?

I don't think you're missing anything.  I forgot that KVM_RUN would require an
elevated mm_users.  x86 does handle the impossible race, but that's 
coincidental.
The extra protections in x86 are to deal with other cases where a vCPU's 
top-level
SPTE can be invalidated while the vCPU is running.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 12/12] selftests: KVM: Add counter emulation benchmark

2021-07-21 Thread Andrew Jones
On Mon, Jul 19, 2021 at 06:49:49PM +, Oliver Upton wrote:
> Add a test case for counter emulation on arm64. A side effect of how KVM
> handles physical counter offsetting on non-ECV systems is that the
> virtual counter will always hit hardware and the physical could be
> emulated. Force emulation by writing a nonzero offset to the physical
> counter and compare the elapsed cycles to a direct read of the hardware
> register.
> 
> Reviewed-by: Ricardo Koller 
> Signed-off-by: Oliver Upton 
> ---
>  tools/testing/selftests/kvm/.gitignore|   1 +
>  tools/testing/selftests/kvm/Makefile  |   1 +
>  .../kvm/aarch64/counter_emulation_benchmark.c | 215 ++
>  3 files changed, 217 insertions(+)
>  create mode 100644 
> tools/testing/selftests/kvm/aarch64/counter_emulation_benchmark.c
> 
> diff --git a/tools/testing/selftests/kvm/.gitignore 
> b/tools/testing/selftests/kvm/.gitignore
> index 2752813d5090..1d811c6a769b 100644
> --- a/tools/testing/selftests/kvm/.gitignore
> +++ b/tools/testing/selftests/kvm/.gitignore
> @@ -1,5 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  /aarch64/debug-exceptions
> +/aarch64/counter_emulation_benchmark

alphabetic order please

>  /aarch64/get-reg-list
>  /aarch64/vgic_init
>  /s390x/memop
> diff --git a/tools/testing/selftests/kvm/Makefile 
> b/tools/testing/selftests/kvm/Makefile
> index d89908108c97..e560a3e74bc2 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -86,6 +86,7 @@ TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test
>  TEST_GEN_PROGS_x86_64 += system_counter_offset_test
>  
>  TEST_GEN_PROGS_aarch64 += aarch64/debug-exceptions
> +TEST_GEN_PROGS_aarch64 += aarch64/counter_emulation_benchmark

alphabetic order please

>  TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list
>  TEST_GEN_PROGS_aarch64 += aarch64/vgic_init
>  TEST_GEN_PROGS_aarch64 += demand_paging_test
> diff --git 
> a/tools/testing/selftests/kvm/aarch64/counter_emulation_benchmark.c 
> b/tools/testing/selftests/kvm/aarch64/counter_emulation_benchmark.c
> new file mode 100644
> index ..73aeb6cdebfe
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/aarch64/counter_emulation_benchmark.c
> @@ -0,0 +1,215 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * counter_emulation_benchmark.c -- test to measure the effects of counter
> + * emulation on guest reads of the physical counter.
> + *
> + * Copyright (c) 2021, Google LLC.
> + */
> +
> +#define _GNU_SOURCE
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "kvm_util.h"
> +#include "processor.h"
> +#include "test_util.h"
> +
> +#define VCPU_ID 0
> +
> +static struct counter_values {
> + uint64_t cntvct_start;
> + uint64_t cntpct;
> + uint64_t cntvct_end;
> +} counter_values;
> +
> +static uint64_t nr_iterations = 1000;
> +
> +static void do_test(void)
> +{
> + /*
> +  * Open-coded approach instead of using helper methods to keep a tight
> +  * interval around the physical counter read.
> +  */
> + asm volatile("isb\n\t"
> +  "mrs %[cntvct_start], cntvct_el0\n\t"
> +  "isb\n\t"
> +  "mrs %[cntpct], cntpct_el0\n\t"
> +  "isb\n\t"
> +  "mrs %[cntvct_end], cntvct_el0\n\t"
> +  "isb\n\t"
> +  : [cntvct_start] "=r"(counter_values.cntvct_start),
> +  [cntpct] "=r"(counter_values.cntpct),
> +  [cntvct_end] "=r"(counter_values.cntvct_end));
> +}
> +
> +static void guest_main(void)
> +{
> + int i;
> +
> + for (i = 0; i < nr_iterations; i++) {
> + do_test();
> + GUEST_SYNC(i);
> + }
> +
> + for (i = 0; i < nr_iterations; i++) {
> + do_test();
> + GUEST_SYNC(i);
> + }
> +
> + GUEST_DONE();
> +}
> +
> +static bool enter_guest(struct kvm_vm *vm)
> +{
> + struct ucall uc;
> +
> + vcpu_ioctl(vm, VCPU_ID, KVM_RUN, NULL);
> +
> + switch (get_ucall(vm, VCPU_ID, )) {
> + case UCALL_DONE:
> + return true;
> + case UCALL_SYNC:
> + break;
> + case UCALL_ABORT:
> + TEST_ASSERT(false, "%s at %s:%ld", (const char *)uc.args[0],
> + __FILE__, uc.args[1]);
> + break;
> + default:
> + TEST_ASSERT(false, "unexpected exit: %s",
> + exit_reason_str(vcpu_state(vm, 
> VCPU_ID)->exit_reason));
> + break;
> + }
> +
> + /* more work to do in the guest */
> + return false;
> +}
> +
> +static double counter_frequency(void)
> +{
> + uint32_t freq;
> +
> + asm volatile("mrs %0, cntfrq_el0"
> +  : "=r" (freq));
> +
> + return freq / 100.0;
> +}
> +
> +static void log_csv(FILE *csv, bool trapped)
> +{
> + double freq = counter_frequency();
> +
> + fprintf(csv, "%s,%.02f,%lu,%lu,%lu\n",
> + trapped ? "true" : 

Re: [RFC PATCH 0/5] KVM: arm64: Pass PSCI to userspace

2021-07-21 Thread Jean-Philippe Brucker
On Tue, Jun 08, 2021 at 05:48:01PM +0200, Jean-Philippe Brucker wrote:
> Allow userspace to request handling PSCI calls from guests. Our goal is
> to enable a vCPU hot-add solution for Arm where the VMM presents
> possible resources to the guest at boot, and controls which vCPUs can be
> brought up by allowing or denying PSCI CPU_ON calls.

Since it looks like vCPU hot-add will be implemented differently, I don't
intend to resend this series at the moment. But some of it could be
useful for other projects and to avoid the helpful review effort going to
waste, I fixed it up and will leave it on branch
https://jpbrucker.net/git/linux/log/?h=kvm/psci-to-userspace
It now only uses KVM_CAP_EXIT_HYPERCALL introduced in v5.14.

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 0/5] KVM: arm64: Pass PSCI to userspace

2021-07-21 Thread Jean-Philippe Brucker
On Mon, Jul 19, 2021 at 12:37:52PM -0700, Oliver Upton wrote:
> On Mon, Jul 19, 2021 at 11:02 AM Jean-Philippe Brucker
>  wrote:
> > We forward the whole PSCI function range, so it's either KVM or userspace.
> > If KVM manages PSCI and the guest calls an unimplemented function, that
> > returns directly to the guest without going to userspace.
> >
> > The concern is valid for any other range, though. If userspace enables the
> > HVC cap it receives function calls that at some point KVM might need to
> > handle itself. So we need some negotiation between user and KVM about the
> > specific HVC ranges that userspace can and will handle.
> 
> Are we going to use KVM_CAPs for every interesting HVC range that
> userspace may want to trap? I wonder if a more generic interface for
> hypercall filtering would have merit to handle the aforementioned
> cases, and whatever else a VMM will want to intercept down the line.
> 
> For example, x86 has the concept of 'MSR filtering', wherein userspace
> can specify a set of registers that it wants to intercept. Doing
> something similar for HVCs would avoid the need for a kernel change
> each time a VMM wishes to intercept a new hypercall.

Yes we could introduce a VM device group for this:
* User reads attribute KVM_ARM_VM_HVC_NR_SLOTS, which defines the number
  of available HVC ranges.
* User writes attribute KVM_ARM_VM_HVC_SET_RANGE with one range
  struct kvm_arm_hvc_range {
  __u32 slot;
  #define KVM_ARM_HVC_USER (1 << 0) /* Enable range. 0 disables it */
  __u16 flags;
  __u16 imm;
  __u32 fn_start;
  __u32 fn_end;
  };
* KVM forwards any HVC within this range to userspace.
* If one of the ranges is PSCI functions, disable KVM PSCI.

Since it's more work for KVM to keep track of ranges, I didn't include it
in the RFC, and I'm going to leave it to the next person dealing with this
stuff :)

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 13/14] KVM: arm64: Restrict hyp stage-1 manipulation in protected mode

2021-07-21 Thread Quentin Perret
Hi Fuad,

On Wednesday 21 Jul 2021 at 11:45:53 (+0100), Fuad Tabba wrote:
> > +static int hyp_range_is_shared_walker(u64 addr, u64 end, u32 level,
> > + kvm_pte_t *ptep,
> > + enum kvm_pgtable_walk_flags flag,
> > + void * const arg)
> > +{
> > +   enum kvm_pgtable_prot prot;
> > +   kvm_pte_t pte = *ptep;
> > +
> > +   if (!kvm_pte_valid(pte))
> > +   return -EPERM;
> > +
> > +   prot = kvm_pgtable_hyp_pte_prot(pte);
> > +   if (!prot)
> > +   return -EPERM;
> nit: is this check necessary?

I guess not, the next one should do, I'll remove :)

> > +   /* Check that the page has been shared with the hypervisor before */
> > +   if (prot != (PAGE_HYP | KVM_PGTABLE_STATE_SHARED | 
> > KVM_PGTABLE_STATE_BORROWED))
> > +   return -EPERM;
> > +
> > +   return 0;
> > +}
> > +
> > +static int hyp_range_is_shared(phys_addr_t start, phys_addr_t end)
> > +{
> > +   struct kvm_pgtable_walker walker = {
> > +   .cb = hyp_range_is_shared_walker,
> > +   .flags = KVM_PGTABLE_WALK_LEAF,
> > +   };
> > +
> > +   return kvm_pgtable_walk(_pgtable, (u64)__hyp_va(start),
> > +   end - start, );
> > +}
> > +
> > +static int check_host_share_hyp_walker(u64 addr, u64 end, u32 level,
> 
> nit: It seems the convention is usually addr,size or start,end. Here
> you're using addr,end.

Well for some reason all the walkers in pgtable.c follow the addr,end
convention, so I followed that. But in fact, as per [1] I might actually
get rid of this walker in v2, so hopefully that'll make the issue go
away.

[1] https://lore.kernel.org/kvmarm/ypbwmvk1yd9+y...@google.com/

> > +  kvm_pte_t *ptep,
> > +  enum kvm_pgtable_walk_flags flag,
> > +  void * const arg)
> > +{
> > +   enum kvm_pgtable_prot prot;
> > +   kvm_pte_t pte = *ptep;
> > +
> > +   /* If invalid, only allow to share pristine pages */
> > +   if (!kvm_pte_valid(pte))
> > +   return pte ? -EPERM : 0;
> > +
> > +   prot = kvm_pgtable_stage2_pte_prot(pte);
> > +   if (!prot)
> > +   return -EPERM;
> > +
> > +   /* Cannot share a page that is not owned */
> > +   if (prot & KVM_PGTABLE_STATE_BORROWED)
> > +   return -EPERM;
> > +
> > +   /* Cannot share a page with restricted access */
> > +   if ((prot & KVM_PGTABLE_PROT_RWX) ^ KVM_PGTABLE_PROT_RWX)
> nit: isn't this clearer as
> 
> if ((prot & KVM_PGTABLE_PROT_RWX) != KVM_PGTABLE_PROT_RWX)

I guess it would be, I'll fix it up.

> > +   return -EPERM;
> > +
> > +   /* Allow double-sharing (requires cross-checking the hyp stage-1) */
> > +   if (prot & KVM_PGTABLE_STATE_SHARED)
> > +   return hyp_range_is_shared(addr, addr + 1);
> 
> Why addr+1, rather than end?

Because 'end' here is the 'end' that was passed to kvm_pgtable_walk()
IIRC. What I want to do here is check if the page I'm currently visiting
is already shared and if so, that it is shared with the hypervisor. But
it's possible that only one page in the range of pages passed to
__pkvm_host_share_hyp is already shared, so I need to check them one by
one.

Anyways, as per the discussion with Marc on [2], I'll probably switch to
only accept sharing one page at a time, so all these issues should just
go away as well!

[2] https://lore.kernel.org/kvmarm/ypa6bguufjw8d...@google.com/

> > +
> > +   return 0;
> > +}
> > +
> > +static int check_host_share_hyp(phys_addr_t start, phys_addr_t end)
> > +{
> > +   struct kvm_pgtable_walker walker = {
> > +   .cb = check_host_share_hyp_walker,
> > +   .flags = KVM_PGTABLE_WALK_LEAF,
> > +   };
> > +
> > +   return kvm_pgtable_walk(_kvm.pgt, start, end - start, );
> > +}
> > +
> > +int __pkvm_host_share_hyp(phys_addr_t start, phys_addr_t end)
> > +{
> > +   enum kvm_pgtable_prot prot;
> > +   int ret;
> > +
> > +   if (!range_is_memory(start, end))
> > +   return -EINVAL;
> > +
> > +   hyp_spin_lock(_kvm.lock);
> > +   hyp_spin_lock(_pgd_lock);
> > +
> > +   ret = check_host_share_hyp(start, end);
> > +   if (ret)
> > +   goto unlock;
> > +
> > +   prot = KVM_PGTABLE_PROT_RWX | KVM_PGTABLE_STATE_SHARED;
> > +   ret = host_stage2_idmap_locked(start, end, prot);
> 
> Just for me to understand this better. The id mapping here, which
> wasn't taking place before this patch, is for tracking, right?

Yes, exactly, I want to make sure to mark the page as shared (and
borrowed) in the relevant page-tables to not forget about it :)

Cheers,
Quentin
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 11/12] selftests: KVM: Test physical counter offsetting

2021-07-21 Thread Andrew Jones
On Mon, Jul 19, 2021 at 06:49:48PM +, Oliver Upton wrote:
> Test that userpace adjustment of the guest physical counter-timer
> results in the correct view of within the guest.
> 
> Signed-off-by: Oliver Upton 
> ---
>  .../selftests/kvm/include/aarch64/processor.h | 12 
>  .../kvm/system_counter_offset_test.c  | 29 ---
>  2 files changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/include/aarch64/processor.h 
> b/tools/testing/selftests/kvm/include/aarch64/processor.h
> index 3168cdbae6ee..7f53d90e9512 100644
> --- a/tools/testing/selftests/kvm/include/aarch64/processor.h
> +++ b/tools/testing/selftests/kvm/include/aarch64/processor.h
> @@ -141,4 +141,16 @@ static inline uint64_t read_cntvct_ordered(void)
>   return r;
>  }
>  
> +static inline uint64_t read_cntpct_ordered(void)
> +{
> + uint64_t r;
> +
> + __asm__ __volatile__("isb\n\t"
> +  "mrs %0, cntpct_el0\n\t"
> +  "isb\n\t"
> +  : "=r"(r));
> +
> + return r;
> +}
> +
>  #endif /* SELFTEST_KVM_PROCESSOR_H */
> diff --git a/tools/testing/selftests/kvm/system_counter_offset_test.c 
> b/tools/testing/selftests/kvm/system_counter_offset_test.c
> index 88ad997f5b69..3eed9dcb7693 100644
> --- a/tools/testing/selftests/kvm/system_counter_offset_test.c
> +++ b/tools/testing/selftests/kvm/system_counter_offset_test.c
> @@ -57,6 +57,7 @@ static uint64_t host_read_guest_system_counter(struct 
> test_case *test)
>  
>  enum arch_counter {
>   VIRTUAL,
> + PHYSICAL,
>  };
>  
>  struct test_case {
> @@ -68,23 +69,41 @@ static struct test_case test_cases[] = {
>   { .counter = VIRTUAL, .offset = 0 },
>   { .counter = VIRTUAL, .offset = 180 * NSEC_PER_SEC },
>   { .counter = VIRTUAL, .offset = -180 * NSEC_PER_SEC },
> + { .counter = PHYSICAL, .offset = 0 },
> + { .counter = PHYSICAL, .offset = 180 * NSEC_PER_SEC },
> + { .counter = PHYSICAL, .offset = -180 * NSEC_PER_SEC },
>  };
>  
>  static void check_preconditions(struct kvm_vm *vm)
>  {
>   if (!_vcpu_has_device_attr(vm, VCPU_ID, KVM_ARM_VCPU_TIMER_CTRL,
> -KVM_ARM_VCPU_TIMER_OFFSET_VTIMER))
> +KVM_ARM_VCPU_TIMER_OFFSET_VTIMER) &&
> + !_vcpu_has_device_attr(vm, VCPU_ID, KVM_ARM_VCPU_TIMER_CTRL,
> +KVM_ARM_VCPU_TIMER_OFFSET_PTIMER))
>   return;
>  
> - print_skip("KVM_ARM_VCPU_TIMER_OFFSET_VTIMER not supported; skipping 
> test");
> + print_skip("KVM_ARM_VCPU_TIMER_OFFSET_{VTIMER,PTIMER} not supported; 
> skipping test");
>   exit(KSFT_SKIP);
>  }
>  
>  static void setup_system_counter(struct kvm_vm *vm, struct test_case *test)
>  {
> + u64 attr = 0;
> +
> + switch (test->counter) {
> + case VIRTUAL:
> + attr = KVM_ARM_VCPU_TIMER_OFFSET_VTIMER;
> + break;
> + case PHYSICAL:
> + attr = KVM_ARM_VCPU_TIMER_OFFSET_PTIMER;
> + break;
> + default:
> + TEST_ASSERT(false, "unrecognized counter index %u",
> + test->counter);
> + }
> +
>   vcpu_access_device_attr(vm, VCPU_ID, KVM_ARM_VCPU_TIMER_CTRL,
> - KVM_ARM_VCPU_TIMER_OFFSET_VTIMER, >offset,
> - true);
> + attr, >offset, true);
>  }
>  
>  static uint64_t guest_read_system_counter(struct test_case *test)
> @@ -92,6 +111,8 @@ static uint64_t guest_read_system_counter(struct test_case 
> *test)
>   switch (test->counter) {
>   case VIRTUAL:
>   return read_cntvct_ordered();
> + case PHYSICAL:
> + return read_cntpct_ordered();
>   default:
>   GUEST_ASSERT(0);
>   }
> -- 
> 2.32.0.402.g57bb445576-goog
> 
> ___
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>

 
Reviewed-by: Andrew Jones 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 10/12] KVM: arm64: Provide userspace access to the physical counter offset

2021-07-21 Thread Andrew Jones
On Mon, Jul 19, 2021 at 06:49:47PM +, Oliver Upton wrote:
> Presently, KVM provides no facilities for correctly migrating a guest
> that depends on the physical counter-timer. While most guests (barring
> NV, of course) should not depend on the physical counter-timer, an
> operator may still wish to provide a consistent view of the physical
> counter-timer across migrations.
> 
> Provide userspace with a new vCPU attribute to modify the guest physical
> counter-timer offset. Since the base architecture doesn't provide a
> physical counter-timer offset register, emulate the correct behavior by
> trapping accesses to the physical counter-timer whenever the offset
> value is non-zero.
> 
> Uphold the same behavior as CNTVOFF_EL2 and broadcast the physical
> offset to all vCPUs whenever written. This guarantees that the
> counter-timer we provide the guest remains architectural, wherein all
> views of the counter-timer are consistent across vCPUs. Reconfigure
> timer traps for VHE on every guest entry, as different VMs will now have
> different traps enabled. Enable physical counter traps for nVHE whenever
> the offset is nonzero (we already trap physical timer registers in
> nVHE).
> 
> FEAT_ECV provides a guest physical counter-timer offset register
> (CNTPOFF_EL2), but ECV-enabled hardware is nonexistent at the time of
> writing so support for it was elided for the sake of the author :)
> 
> Signed-off-by: Oliver Upton 
> ---
>  Documentation/virt/kvm/devices/vcpu.rst   | 22 ++
>  arch/arm64/include/asm/kvm_host.h |  1 +
>  arch/arm64/include/asm/kvm_hyp.h  |  2 -
>  arch/arm64/include/asm/sysreg.h   |  1 +
>  arch/arm64/include/uapi/asm/kvm.h |  1 +
>  arch/arm64/kvm/arch_timer.c   | 50 ---
>  arch/arm64/kvm/arm.c  |  4 +-
>  arch/arm64/kvm/hyp/include/hyp/switch.h   | 23 +++
>  arch/arm64/kvm/hyp/include/hyp/timer-sr.h | 26 
>  arch/arm64/kvm/hyp/nvhe/switch.c  |  2 -
>  arch/arm64/kvm/hyp/nvhe/timer-sr.c| 21 +-
>  arch/arm64/kvm/hyp/vhe/timer-sr.c | 27 
>  include/kvm/arm_arch_timer.h  |  2 -
>  13 files changed, 158 insertions(+), 24 deletions(-)
>  create mode 100644 arch/arm64/kvm/hyp/include/hyp/timer-sr.h
> 
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst 
> b/Documentation/virt/kvm/devices/vcpu.rst
> index 7b57cba3416a..a8547ce09b47 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -161,6 +161,28 @@ the following equation:
>  KVM does not allow the use of varying offset values for different vCPUs;
>  the last written offset value will be broadcasted to all vCPUs in a VM.
>  
> +2.3. ATTRIBUTE: KVM_ARM_VCPU_TIMER_OFFSET_PTIMER
> +
> +
> +:Parameters: Pointer to a 64-bit unsigned counter-timer offset.
> +
> +Returns:
> +
> + === ==
> + -EFAULT Error reading/writing the provided
> + parameter address
> + -ENXIO  Attribute not supported
> + === ==
> +
> +Specifies the guest's physical counter-timer offset from the host's
> +virtual counter. The guest's physical counter is then derived by
> +the following equation:
> +
> +  guest_cntpct = host_cntvct - KVM_ARM_VCPU_TIMER_OFFSET_PTIMER
> +
> +KVM does not allow the use of varying offset values for different vCPUs;
> +the last written offset value will be broadcasted to all vCPUs in a VM.
> +
>  3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL
>  ==
>  
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 41911585ae0c..de92fa678924 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -204,6 +204,7 @@ enum vcpu_sysreg {
>   SP_EL1,
>   SPSR_EL1,
>  
> + CNTPOFF_EL2,
>   CNTVOFF_EL2,
>   CNTV_CVAL_EL0,
>   CNTV_CTL_EL0,
> diff --git a/arch/arm64/include/asm/kvm_hyp.h 
> b/arch/arm64/include/asm/kvm_hyp.h
> index 9d60b3006efc..01eb3864e50f 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -65,10 +65,8 @@ void __vgic_v3_save_aprs(struct vgic_v3_cpu_if *cpu_if);
>  void __vgic_v3_restore_aprs(struct vgic_v3_cpu_if *cpu_if);
>  int __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu);
>  
> -#ifdef __KVM_NVHE_HYPERVISOR__
>  void __timer_enable_traps(struct kvm_vcpu *vcpu);
>  void __timer_disable_traps(struct kvm_vcpu *vcpu);
> -#endif
>  
>  #ifdef __KVM_NVHE_HYPERVISOR__
>  void __sysreg_save_state_nvhe(struct kvm_cpu_context *ctxt);
> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> index 347ccac2341e..243e36c088e7 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -505,6 +505,7 @@
>  #define 

Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size

2021-07-21 Thread Alexandru Elisei
Hi Sean,

Thank you for writing this, it explains exactly what I wanted to know.

On 7/20/21 9:33 PM, Sean Christopherson wrote:
> On Tue, Jul 20, 2021, Alexandru Elisei wrote:
>> Hi Marc,
>>
>> I just can't figure out why having the mmap lock is not needed to walk the
>> userspace page tables. Any hints? Or am I not seeing where it's taken?
> Disclaimer: I'm not super familiar with arm64's page tables, but the relevant 
> KVM
> functionality is common across x86 and arm64.
>
> KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct
> associated with the VM, and disallows calling ioctls from a different process,
> i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for 
> which
> KVM registered the mmu_notifier.  As part of registration, the mmu_notifier
> does mmgrab() and doesn't do mmdrop() until it's unregistered.  That ensures 
> the
> mm_struct itself is live.
>
> For the page tables liveliness, KVM implements mmu_notifier_ops.release, 
> which is
> invoked at the beginning of exit_mmap(), before the page tables are freed.  In
> its implementation, KVM takes mmu_lock and zaps all its shadow page tables, 
> a.k.a.
> the stage2 tables in KVM arm64.  The flow in question, 
> get_user_mapping_size(),
> also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
> guaranteed to run with live userspace tables.
>
> Lastly, KVM also implements mmu_notifier_ops.invalidate_range_{start,end}.  
> KVM's
> invalidate_range implementations also take mmu_lock, and also update a 
> sequence
> counter and a flag stating that there's an invalidation in progress.  When
> installing a stage2 entry, KVM snapshots the sequence counter before taking
> mmu_lock, and then checks it again after acquiring mmu_lock.  If the counter
> mismatches, or an invalidation is in-progress, then KVM bails and resumes the
> guest without fixing the fault.
>
> E.g. if the host zaps userspace page tables and KVM "wins" the race, the 
> subsequent
> kvm_mmu_notifier_invalidate_range_start() will zap the recently installed 
> stage2
> entries.  And if the host zap "wins" the race, KVM will resume the guest, 
> which
> in normal operation will hit the exception again and go back through the 
> entire
> process of installing stage2 entries.
>
> Looking at the arm64 code, one thing I'm not clear on is whether arm64 
> correctly
> handles the case where exit_mmap() wins the race.  The invalidate_range hooks 
> will
> still be called, so userspace page tables aren't a problem, but
> kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt 
> without
> any additional notifications that I see.  x86 deals with this by ensuring its
> top-level TDP entry (stage2 equivalent) is valid while the page fault handler 
> is
> running.
>
>   void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>   {
>   struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>   struct kvm_pgtable *pgt = NULL;
>
>   spin_lock(>mmu_lock);
>   pgt = mmu->pgt;
>   if (pgt) {
>   mmu->pgd_phys = 0;
>   mmu->pgt = NULL;
>   free_percpu(mmu->last_vcpu_ran);
>   }
>   spin_unlock(>mmu_lock);
>
>   ...
>   }
>
> AFAICT, nothing in user_mem_abort() would prevent consuming that null mmu->pgt
> if exit_mmap() collidied with user_mem_abort().
>
>   static int user_mem_abort(...)
>   {
>
>   ...
>
>   spin_lock(>mmu_lock);
>   pgt = vcpu->arch.hw_mmu->pgt; <-- hw_mmu->pgt may be NULL 
> (hw_mmu points at vcpu->kvm->arch.mmu)
>   if (mmu_notifier_retry(kvm, mmu_seq)) <-- mmu_seq not guaranteed to 
> change
>   goto out_unlock;
>
>   ...
>
>   if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
>   ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
>   } else {
>   ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
>__pfn_to_phys(pfn), prot,
>memcache);
>   }
>   }
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 3/3] kvm/arm: Align the VMID allocation with the arm64 ASID one

2021-07-21 Thread Will Deacon
[+Quentin]

On Wed, Jun 16, 2021 at 04:56:06PM +0100, Shameer Kolothum wrote:
> From: Julien Grall 
> 
> At the moment, the VMID algorithm will send an SGI to all the CPUs to
> force an exit and then broadcast a full TLB flush and I-Cache
> invalidation.
> 
> This patch use the new VMID allocator. The
> benefits are:
> - CPUs are not forced to exit at roll-over. Instead the VMID will be
> marked reserved and the context will be flushed at next exit. This
> will reduce the IPIs traffic.
> - Context invalidation is now per-CPU rather than broadcasted.
> - Catalin has a formal model of the ASID allocator.
> 
> With the new algo, the code is now adapted:
> - The function __kvm_flush_vm_context() has been renamed to
> __kvm_tlb_flush_local_all() and now only flushing the current CPU
> context.
> - The call to update_vmid() will be done with preemption disabled
> as the new algo requires to store information per-CPU.
> - The TLBs associated to EL1 will be flushed when booting a CPU to
> deal with stale information. This was previously done on the
> allocation of the first VMID of a new generation.
> 
> Signed-off-by: Julien Grall 
> Signed-off-by: Shameer Kolothum 
> ---
>  arch/arm64/include/asm/kvm_asm.h  |   4 +-
>  arch/arm64/include/asm/kvm_host.h |   6 +-
>  arch/arm64/include/asm/kvm_mmu.h  |   3 +-
>  arch/arm64/kvm/Makefile   |   2 +-
>  arch/arm64/kvm/arm.c  | 115 +++---
>  arch/arm64/kvm/hyp/nvhe/hyp-main.c|   6 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |   3 +-
>  arch/arm64/kvm/hyp/nvhe/tlb.c |  10 +--
>  arch/arm64/kvm/hyp/vhe/tlb.c  |  10 +--
>  arch/arm64/kvm/mmu.c  |   1 -
>  10 files changed, 52 insertions(+), 108 deletions(-)

[...]

> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 75a7e8071012..d96284da8571 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -70,9 +70,7 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>  void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
>  
>  struct kvm_vmid {
> - /* The VMID generation used for the virt. memory system */
> - u64vmid_gen;
> - u32vmid;
> + atomic64_t id;

Maybe a typedef would be better if this is the only member of the structure?

> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c 
> b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 4b60c0056c04..a02c4877a055 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -106,8 +106,7 @@ int kvm_host_prepare_stage2(void *mem_pgt_pool, void 
> *dev_pgt_pool)
>   mmu->pgd_phys = __hyp_pa(host_kvm.pgt.pgd);
>   mmu->arch = _kvm.arch;
>   mmu->pgt = _kvm.pgt;
> - mmu->vmid.vmid_gen = 0;
> - mmu->vmid.vmid = 0;
> + atomic64_set(>vmid.id, 0);

I think this is the first atomic64 use in the EL2 object, which may pull in
some fatal KCSAN instrumentation. Quentin, have you run into this before?

Might be simple just to zero-initialise mmu for now, if it isn't already.

> diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
> index 83dc3b271bc5..42df9931ed9a 100644
> --- a/arch/arm64/kvm/hyp/nvhe/tlb.c
> +++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
> @@ -140,10 +140,10 @@ void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu)
>   __tlb_switch_to_host();
>  }
>  
> -void __kvm_flush_vm_context(void)
> +void __kvm_tlb_flush_local_all(void)
>  {
> - dsb(ishst);
> - __tlbi(alle1is);
> + dsb(nshst);
> + __tlbi(alle1);
>  
>   /*
>* VIPT and PIPT caches are not affected by VMID, so no maintenance
> @@ -155,7 +155,7 @@ void __kvm_flush_vm_context(void)
>*
>*/
>   if (icache_is_vpipt())
> - asm volatile("ic ialluis");
> + asm volatile("ic iallu" : : );
>  
> - dsb(ish);
> + dsb(nsh);

Hmm, I'm wondering whether having this local stuff really makes sense for
VMIDs. For ASIDs, where rollover can be frequent and TLBI could result in
IPI on 32-bit, the local option was important, but here rollover is less
frequent, DVM is relied upon to work and the cost of a hypercall is
significant with nVHE.

So I do think you could simplify patch 2 slightly to drop the
flush_pending and just issue inner-shareable invalidation on rollover.
With that, it might also make it straightforward to clear active_asids
when scheduling out a vCPU, which would solve the other problem I mentioned
about unnecessarily reserving a bunch of the VMID space.

Will
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 09/12] selftests: KVM: Add support for aarch64 to system_counter_offset_test

2021-07-21 Thread Andrew Jones
On Mon, Jul 19, 2021 at 06:49:46PM +, Oliver Upton wrote:
> KVM/arm64 now allows userspace to adjust the guest virtual counter-timer
> via a vCPU device attribute. Test that changes to the virtual
> counter-timer offset result in the correct view being presented to the
> guest.
> 
> Signed-off-by: Oliver Upton 
> ---
>  tools/testing/selftests/kvm/Makefile  |  1 +
>  .../selftests/kvm/include/aarch64/processor.h | 12 +
>  .../kvm/system_counter_offset_test.c  | 54 ++-
>  3 files changed, 66 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/kvm/Makefile 
> b/tools/testing/selftests/kvm/Makefile
> index 7bf2e5fb1d5a..d89908108c97 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -96,6 +96,7 @@ TEST_GEN_PROGS_aarch64 += kvm_page_table_test
>  TEST_GEN_PROGS_aarch64 += set_memory_region_test
>  TEST_GEN_PROGS_aarch64 += steal_time
>  TEST_GEN_PROGS_aarch64 += kvm_binary_stats_test
> +TEST_GEN_PROGS_aarch64 += system_counter_offset_test
>  
>  TEST_GEN_PROGS_s390x = s390x/memop
>  TEST_GEN_PROGS_s390x += s390x/resets
> diff --git a/tools/testing/selftests/kvm/include/aarch64/processor.h 
> b/tools/testing/selftests/kvm/include/aarch64/processor.h
> index 27dc5c2e56b9..3168cdbae6ee 100644
> --- a/tools/testing/selftests/kvm/include/aarch64/processor.h
> +++ b/tools/testing/selftests/kvm/include/aarch64/processor.h
> @@ -129,4 +129,16 @@ void vm_install_sync_handler(struct kvm_vm *vm,
>  
>  #define isb()asm volatile("isb" : : : "memory")
>  
> +static inline uint64_t read_cntvct_ordered(void)
> +{
> + uint64_t r;
> +
> + __asm__ __volatile__("isb\n\t"
> +  "mrs %0, cntvct_el0\n\t"
> +  "isb\n\t"
> +  : "=r"(r));
> +
> + return r;
> +}
> +
>  #endif /* SELFTEST_KVM_PROCESSOR_H */
> diff --git a/tools/testing/selftests/kvm/system_counter_offset_test.c 
> b/tools/testing/selftests/kvm/system_counter_offset_test.c
> index 7e9015770759..88ad997f5b69 100644
> --- a/tools/testing/selftests/kvm/system_counter_offset_test.c
> +++ b/tools/testing/selftests/kvm/system_counter_offset_test.c
> @@ -53,7 +53,59 @@ static uint64_t host_read_guest_system_counter(struct 
> test_case *test)
>   return rdtsc() + test->tsc_offset;
>  }
>  
> -#else /* __x86_64__ */
> +#elif __aarch64__ /* __x86_64__ */
> +
> +enum arch_counter {
> + VIRTUAL,
> +};
> +
> +struct test_case {
> + enum arch_counter counter;
> + uint64_t offset;
> +};
> +
> +static struct test_case test_cases[] = {
> + { .counter = VIRTUAL, .offset = 0 },
> + { .counter = VIRTUAL, .offset = 180 * NSEC_PER_SEC },
> + { .counter = VIRTUAL, .offset = -180 * NSEC_PER_SEC },
> +};
> +
> +static void check_preconditions(struct kvm_vm *vm)
> +{
> + if (!_vcpu_has_device_attr(vm, VCPU_ID, KVM_ARM_VCPU_TIMER_CTRL,
> +KVM_ARM_VCPU_TIMER_OFFSET_VTIMER))
> + return;
> +
> + print_skip("KVM_ARM_VCPU_TIMER_OFFSET_VTIMER not supported; skipping 
> test");
> + exit(KSFT_SKIP);
> +}
> +
> +static void setup_system_counter(struct kvm_vm *vm, struct test_case *test)
> +{
> + vcpu_access_device_attr(vm, VCPU_ID, KVM_ARM_VCPU_TIMER_CTRL,
> + KVM_ARM_VCPU_TIMER_OFFSET_VTIMER, >offset,
> + true);
> +}
> +
> +static uint64_t guest_read_system_counter(struct test_case *test)
> +{
> + switch (test->counter) {
> + case VIRTUAL:
> + return read_cntvct_ordered();
> + default:
> + GUEST_ASSERT(0);
> + }
> +
> + /* unreachable */
> + return 0;
> +}
> +
> +static uint64_t host_read_guest_system_counter(struct test_case *test)
> +{
> + return read_cntvct_ordered() - test->offset;
> +}
> +
> +#else /* __aarch64__ */
>  
>  #error test not implemented for this architecture!
>  
> -- 
> 2.32.0.402.g57bb445576-goog
> 
> ___
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>

Reviewed-by: Andrew Jones 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCHv2 4/4] arm64: add host pv-vcpu-state support

2021-07-21 Thread Joel Fernandes
On Mon, Jul 12, 2021 at 12:24 PM Marc Zyngier  wrote:
>
[...]
> > +}
> > +
> > +static inline bool
> > +kvm_arm_is_vcpu_state_enabled(struct kvm_vcpu_arch *vcpu_arch)
> > +{
> > + return (vcpu_arch->vcpu_state.base != GPA_INVALID);
> > +}
> > +
> > +void kvm_update_vcpu_preempted(struct kvm_vcpu *vcpu, bool preempted);
> > +
> >  void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 syndrome);
> >
> >  struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr);
> > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> > index 989bb5dad2c8..2a3ee82c6d90 100644
> > --- a/arch/arm64/kvm/Makefile
> > +++ b/arch/arm64/kvm/Makefile
> > @@ -12,7 +12,8 @@ obj-$(CONFIG_KVM) += hyp/
> >
> >  kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
> >$(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \
> > -  arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
> > +  arm.o mmu.o mmio.o psci.o perf.o hypercalls.o \
> > +  pvtime.o pv-vcpu-state.o \
> >inject_fault.o va_layout.o handle_exit.o \
> >guest.o debug.o reset.o sys_regs.o \
> >vgic-sys-reg-v3.o fpsimd.o pmu.o \
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index e9a2b8f27792..43e995c9fddb 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -332,6 +332,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >   kvm_arm_reset_debug_ptr(vcpu);
> >
> >   kvm_arm_pvtime_vcpu_init(>arch);
> > + kvm_arm_vcpu_state_init(>arch);
> >
> >   vcpu->arch.hw_mmu = >kvm->arch.mmu;
> >
> > @@ -429,10 +430,12 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int 
> > cpu)
> >   if (vcpu_has_ptrauth(vcpu))
> >   vcpu_ptrauth_disable(vcpu);
> >   kvm_arch_vcpu_load_debug_state_flags(vcpu);
> > + kvm_update_vcpu_preempted(vcpu, false);
> >  }
> >
> >  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> >  {
> > + kvm_update_vcpu_preempted(vcpu, true);
>
> This doesn't look right. With this, you are now telling the guest that
> a vcpu that is blocked on WFI is preempted. This really isn't the
> case, as it has voluntarily entered a low-power mode while waiting for
> an interrupt. Indeed, the vcpu isn't running. A physical CPU wouldn't
> be running either.

Can that be cured by just checking vcpu->preempted before calling
kvm_update_vcpu_preempted() ?

- Joel
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size

2021-07-21 Thread Sean Christopherson
On Tue, Jul 20, 2021, Alexandru Elisei wrote:
> Hi Marc,
> 
> I just can't figure out why having the mmap lock is not needed to walk the
> userspace page tables. Any hints? Or am I not seeing where it's taken?

Disclaimer: I'm not super familiar with arm64's page tables, but the relevant 
KVM
functionality is common across x86 and arm64.

KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct
associated with the VM, and disallows calling ioctls from a different process,
i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for 
which
KVM registered the mmu_notifier.  As part of registration, the mmu_notifier
does mmgrab() and doesn't do mmdrop() until it's unregistered.  That ensures the
mm_struct itself is live.

For the page tables liveliness, KVM implements mmu_notifier_ops.release, which 
is
invoked at the beginning of exit_mmap(), before the page tables are freed.  In
its implementation, KVM takes mmu_lock and zaps all its shadow page tables, 
a.k.a.
the stage2 tables in KVM arm64.  The flow in question, get_user_mapping_size(),
also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
guaranteed to run with live userspace tables.

Lastly, KVM also implements mmu_notifier_ops.invalidate_range_{start,end}.  
KVM's
invalidate_range implementations also take mmu_lock, and also update a sequence
counter and a flag stating that there's an invalidation in progress.  When
installing a stage2 entry, KVM snapshots the sequence counter before taking
mmu_lock, and then checks it again after acquiring mmu_lock.  If the counter
mismatches, or an invalidation is in-progress, then KVM bails and resumes the
guest without fixing the fault.

E.g. if the host zaps userspace page tables and KVM "wins" the race, the 
subsequent
kvm_mmu_notifier_invalidate_range_start() will zap the recently installed stage2
entries.  And if the host zap "wins" the race, KVM will resume the guest, which
in normal operation will hit the exception again and go back through the entire
process of installing stage2 entries.

Looking at the arm64 code, one thing I'm not clear on is whether arm64 correctly
handles the case where exit_mmap() wins the race.  The invalidate_range hooks 
will
still be called, so userspace page tables aren't a problem, but
kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt without
any additional notifications that I see.  x86 deals with this by ensuring its
top-level TDP entry (stage2 equivalent) is valid while the page fault handler is
running.

  void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
  {
struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
struct kvm_pgtable *pgt = NULL;

spin_lock(>mmu_lock);
pgt = mmu->pgt;
if (pgt) {
mmu->pgd_phys = 0;
mmu->pgt = NULL;
free_percpu(mmu->last_vcpu_ran);
}
spin_unlock(>mmu_lock);

...
  }

AFAICT, nothing in user_mem_abort() would prevent consuming that null mmu->pgt
if exit_mmap() collidied with user_mem_abort().

  static int user_mem_abort(...)
  {

...

spin_lock(>mmu_lock);
pgt = vcpu->arch.hw_mmu->pgt; <-- hw_mmu->pgt may be NULL 
(hw_mmu points at vcpu->kvm->arch.mmu)
if (mmu_notifier_retry(kvm, mmu_seq)) <-- mmu_seq not guaranteed to 
change
goto out_unlock;

...

if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
} else {
ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
 __pfn_to_phys(pfn), prot,
 memcache);
}
  }
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 08/12] KVM: arm64: Allow userspace to configure a vCPU's virtual offset

2021-07-21 Thread Andrew Jones
On Mon, Jul 19, 2021 at 06:49:45PM +, Oliver Upton wrote:
> Add a new vCPU attribute that allows userspace to directly manipulate
> the virtual counter-timer offset. Exposing such an interface allows for
> the precise migration of guest virtual counter-timers, as it is an
> indepotent interface.
> 
> Uphold the existing behavior of writes to CNTVOFF_EL2 for this new
> interface, wherein a write to a single vCPU is broadcasted to all vCPUs
> within a VM.
> 
> Signed-off-by: Oliver Upton 
> ---
>  Documentation/virt/kvm/devices/vcpu.rst | 22 
>  arch/arm64/include/uapi/asm/kvm.h   |  1 +
>  arch/arm64/kvm/arch_timer.c | 68 -
>  3 files changed, 89 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst 
> b/Documentation/virt/kvm/devices/vcpu.rst
> index b46d5f742e69..7b57cba3416a 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -139,6 +139,28 @@ configured values on other VCPUs.  Userspace should 
> configure the interrupt
>  numbers on at least one VCPU after creating all VCPUs and before running any
>  VCPUs.
>  
> +2.2. ATTRIBUTE: KVM_ARM_VCPU_TIMER_OFFSET_VTIMER
> +
> +
> +:Parameters: Pointer to a 64-bit unsigned counter-timer offset.
> +
> +Returns:
> +
> +  === ==
> +  -EFAULT Error reading/writing the provided
> +  parameter address
> +  -ENXIO  Attribute not supported
> +  === ==
> +
> +Specifies the guest's virtual counter-timer offset from the host's
> +virtual counter. The guest's virtual counter is then derived by
> +the following equation:
> +
> +  guest_cntvct = host_cntvct - KVM_ARM_VCPU_TIMER_OFFSET_VTIMER
> +
> +KVM does not allow the use of varying offset values for different vCPUs;
> +the last written offset value will be broadcasted to all vCPUs in a VM.
> +
>  3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL
>  ==
>  
> diff --git a/arch/arm64/include/uapi/asm/kvm.h 
> b/arch/arm64/include/uapi/asm/kvm.h
> index b3edde68bc3e..008d0518d2b1 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -365,6 +365,7 @@ struct kvm_arm_copy_mte_tags {
>  #define KVM_ARM_VCPU_TIMER_CTRL  1
>  #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER  0
>  #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER  1
> +#define   KVM_ARM_VCPU_TIMER_OFFSET_VTIMER   2
>  #define KVM_ARM_VCPU_PVTIME_CTRL 2
>  #define   KVM_ARM_VCPU_PVTIME_IPA0
>  
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 3df67c127489..d2b1b13af658 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -1305,7 +1305,7 @@ static void set_timer_irqs(struct kvm *kvm, int 
> vtimer_irq, int ptimer_irq)
>   }
>  }
>  
> -int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr 
> *attr)
> +int kvm_arm_timer_set_attr_irq(struct kvm_vcpu *vcpu, struct kvm_device_attr 
> *attr)
>  {
>   int __user *uaddr = (int __user *)(long)attr->addr;
>   struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
> @@ -1338,7 +1338,39 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, 
> struct kvm_device_attr *attr)
>   return 0;
>  }
>  
> -int kvm_arm_timer_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr 
> *attr)
> +int kvm_arm_timer_set_attr_offset(struct kvm_vcpu *vcpu, struct 
> kvm_device_attr *attr)
> +{
> + u64 __user *uaddr = (u64 __user *)(long)attr->addr;
> + u64 offset;
> +
> + if (get_user(offset, uaddr))
> + return -EFAULT;
> +
> + switch (attr->attr) {
> + case KVM_ARM_VCPU_TIMER_OFFSET_VTIMER:
> + update_vtimer_cntvoff(vcpu, offset);
> + break;
> + default:
> + return -ENXIO;
> + }
> +
> + return 0;
> +}
> +
> +int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr 
> *attr)
> +{
> + switch (attr->attr) {
> + case KVM_ARM_VCPU_TIMER_IRQ_VTIMER:
> + case KVM_ARM_VCPU_TIMER_IRQ_PTIMER:
> + return kvm_arm_timer_set_attr_irq(vcpu, attr);
> + case KVM_ARM_VCPU_TIMER_OFFSET_VTIMER:
> + return kvm_arm_timer_set_attr_offset(vcpu, attr);
> + }
> +
> + return -ENXIO;
> +}
> +
> +int kvm_arm_timer_get_attr_irq(struct kvm_vcpu *vcpu, struct kvm_device_attr 
> *attr)
>  {
>   int __user *uaddr = (int __user *)(long)attr->addr;
>   struct arch_timer_context *timer;
> @@ -1359,11 +1391,43 @@ int kvm_arm_timer_get_attr(struct kvm_vcpu *vcpu, 
> struct kvm_device_attr *attr)
>   return put_user(irq, uaddr);
>  }
>  
> +int kvm_arm_timer_get_attr_offset(struct kvm_vcpu *vcpu, struct 
> kvm_device_attr *attr)
> +{
> + u64 __user *uaddr = (u64 __user *)(long)attr->addr;
> + struct arch_timer_context *timer;
> + u64 offset;
> +
> + 

Re: [PATCH v2 2/3] kvm/arm: Introduce a new vmid allocator for KVM

2021-07-21 Thread Will Deacon
On Wed, Jun 16, 2021 at 04:56:05PM +0100, Shameer Kolothum wrote:
> A new VMID allocator for arm64 KVM use. This is based on
> arm64 asid allocator algorithm.
> 
> Signed-off-by: Shameer Kolothum 
> ---
>  arch/arm64/include/asm/kvm_host.h |   4 +
>  arch/arm64/kvm/vmid.c | 206 ++
>  2 files changed, 210 insertions(+)
>  create mode 100644 arch/arm64/kvm/vmid.c

Generally, I prefer this to the alternative of creating a library. However,
I'd probably remove all the duplicated comments in favour of a reference
to the ASID allocator. That way, we can just comment any VMID-specific
behaviour in here.

Some comments below...

> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 7cd7d5c8c4bc..75a7e8071012 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -680,6 +680,10 @@ int kvm_arm_pvtime_get_attr(struct kvm_vcpu *vcpu,
>  int kvm_arm_pvtime_has_attr(struct kvm_vcpu *vcpu,
>   struct kvm_device_attr *attr);
>  
> +int kvm_arm_vmid_alloc_init(void);
> +void kvm_arm_vmid_alloc_free(void);
> +void kvm_arm_update_vmid(atomic64_t *id);
> +
>  static inline void kvm_arm_pvtime_vcpu_init(struct kvm_vcpu_arch *vcpu_arch)
>  {
>   vcpu_arch->steal.base = GPA_INVALID;
> diff --git a/arch/arm64/kvm/vmid.c b/arch/arm64/kvm/vmid.c
> new file mode 100644
> index ..687e18d33130
> --- /dev/null
> +++ b/arch/arm64/kvm/vmid.c
> @@ -0,0 +1,206 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * VMID allocator.
> + *
> + * Based on arch/arm64/mm/context.c
> + *
> + * Copyright (C) 2002-2003 Deep Blue Solutions Ltd, all rights reserved.
> + * Copyright (C) 2012 ARM Ltd.
> + */
> +
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +
> +static u32 vmid_bits;
> +static DEFINE_RAW_SPINLOCK(cpu_vmid_lock);
> +
> +static atomic64_t vmid_generation;
> +static unsigned long *vmid_map;
> +
> +static DEFINE_PER_CPU(atomic64_t, active_vmids);
> +static DEFINE_PER_CPU(u64, reserved_vmids);
> +static cpumask_t tlb_flush_pending;
> +
> +#define VMID_MASK(~GENMASK(vmid_bits - 1, 0))
> +#define VMID_FIRST_VERSION   (1UL << vmid_bits)
> +
> +#define NUM_USER_VMIDS   VMID_FIRST_VERSION
> +#define vmid2idx(vmid)   ((vmid) & ~VMID_MASK)
> +#define idx2vmid(idx)vmid2idx(idx)
> +
> +#define vmid_gen_match(vmid) \
> + (!(((vmid) ^ atomic64_read(_generation)) >> vmid_bits))
> +
> +static void flush_context(void)
> +{
> + int cpu;
> + u64 vmid;
> +
> + bitmap_clear(vmid_map, 0, NUM_USER_VMIDS);
> +
> + for_each_possible_cpu(cpu) {
> + vmid = atomic64_xchg_relaxed(_cpu(active_vmids, cpu), 0);
> + /*
> +  * If this CPU has already been through a
> +  * rollover, but hasn't run another task in
> +  * the meantime, we must preserve its reserved
> +  * VMID, as this is the only trace we have of
> +  * the process it is still running.
> +  */
> + if (vmid == 0)
> + vmid = per_cpu(reserved_vmids, cpu);
> + __set_bit(vmid2idx(vmid), vmid_map);
> + per_cpu(reserved_vmids, cpu) = vmid;
> + }

Hmm, so here we're copying the active_vmids into the reserved_vmids on a
rollover, but I wonder if that's overly pessismistic? For the ASID
allocator, every CPU tends to have a current task so it makes sense, but
I'm not sure it's necessarily the case that every CPU tends to have a
vCPU as the current task. For example, imagine you have a nasty 128-CPU
system with 8-bit VMIDs and each CPU has at some point run a vCPU. Then,
on rollover, we'll immediately reserve half of the VMID space, even if
those vCPUs don't even exist any more.

Not sure if it's worth worrying about, but I wanted to mention it.

> +void kvm_arm_update_vmid(atomic64_t *id)
> +{

Take the kvm_vmid here? That would make:

> + /* Check that our VMID belongs to the current generation. */
> + vmid = atomic64_read(id);
> + if (!vmid_gen_match(vmid)) {
> + vmid = new_vmid(id);
> + atomic64_set(id, vmid);
> + }

A bit more readable, as you could pass the pointer directly to new_vmid
for initialisation.

Will
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 00/12] KVM: Add idempotent controls for migrating system counter state

2021-07-21 Thread Andrew Jones
On Fri, Jul 16, 2021 at 09:26:17PM +, Oliver Upton wrote:
> KVM's current means of saving/restoring system counters is plagued with
> temporal issues. At least on ARM64 and x86, we migrate the guest's
> system counter by-value through the respective guest system register
> values (cntvct_el0, ia32_tsc). Restoring system counters by-value is
> brittle as the state is not idempotent: the host system counter is still
> oscillating between the attempted save and restore. Furthermore, VMMs
> may wish to transparently live migrate guest VMs, meaning that they
> include the elapsed time due to live migration blackout in the guest
> system counter view. The VMM thread could be preempted for any number of
> reasons (scheduler, L0 hypervisor under nested) between the time that
> it calculates the desired guest counter value and when KVM actually sets
> this counter state.
> 
> Despite the value-based interface that we present to userspace, KVM
> actually has idempotent guest controls by way of system counter offsets.
> We can avoid all of the issues associated with a value-based interface
> by abstracting these offset controls in new ioctls. This series
> introduces new vCPU device attributes to provide userspace access to the
> vCPU's system counter offset.
> 
> Patch 1 adopts Paolo's suggestion, augmenting the KVM_{GET,SET}_CLOCK
> ioctls to provide userspace with a (host_tsc, realtime) instant. This is
> essential for a VMM to perform precise migration of the guest's system
> counters.
> 
> Patches 2-3 add support for x86 by shoehorning the new controls into the
> pre-existing synchronization heuristics.
> 
> Patches 4-5 implement a test for the new additions to
> KVM_{GET,SET}_CLOCK.
> 
> Patches 6-7 implement at test for the tsc offset attribute introduced in
> patch 3.
> 
> Patch 8 adds a device attribute for the arm64 virtual counter-timer
> offset.
> 
> Patch 9 extends the test from patch 7 to cover the arm64 virtual
> counter-timer offset.
> 
> Patch 10 adds a device attribute for the arm64 physical counter-timer
> offset. Currently, this is implemented as a synthetic register, forcing
> the guest to trap to the host and emulating the offset in the fast exit
> path. Later down the line we will have hardware with FEAT_ECV, which
> allows the hypervisor to perform physical counter-timer offsetting in
> hardware (CNTPOFF_EL2).
> 
> Patch 11 extends the test from patch 7 to cover the arm64 physical
> counter-timer offset.
> 
> Patch 12 introduces a benchmark to measure the overhead of emulation in
> patch 10.
> 
> Physical counter benchmark
> --
> 
> The following data was collected by running 1 iterations of the
> benchmark test from Patch 6 on an Ampere Mt. Jade reference server, A 2S
> machine with 2 80-core Ampere Altra SoCs. Measurements were collected
> for both VHE and nVHE operation using the `kvm-arm.mode=` command-line
> parameter.
> 
> nVHE
> 
> 
> +++-+
> |   Metric   | Native | Trapped |
> +++-+
> | Average| 54ns   | 148ns   |
> | Standard Deviation | 124ns  | 122ns   |
> | 95th Percentile| 258ns  | 348ns   |
> +++-+
> 
> VHE
> ---
> 
> +++-+
> |   Metric   | Native | Trapped |
> +++-+
> | Average| 53ns   | 152ns   |
> | Standard Deviation | 92ns   | 94ns|
> | 95th Percentile| 204ns  | 307ns   |
> +++-+
> 
> This series applies cleanly to the following commit:
> 
> 1889228d80fe ("KVM: selftests: smm_test: Test SMM enter from L2")
> 
> v1 -> v2:
>   - Reimplemented as vCPU device attributes instead of a distinct ioctl.
>   - Added the (realtime, host_tsc) instant support to
> KVM_{GET,SET}_CLOCK
>   - Changed the arm64 implementation to broadcast counter offset values
> to all vCPUs in a guest. This upholds the architectural expectations
> of a consistent counter-timer across CPUs.
>   - Fixed a bug with traps in VHE mode. We now configure traps on every
> transition into a guest to handle differing VMs (trapped, emulated).
>

Oops, I see there's a v3 of this series. I'll switch to reviewing that. I
think my comments / r-b's apply to that version as well though.

Thanks,
drew 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 1/3] arch/arm64: Introduce a capability to tell whether 16-bit VMID is available

2021-07-21 Thread Will Deacon
On Wed, Jun 16, 2021 at 04:56:04PM +0100, Shameer Kolothum wrote:
> From: Julien Grall 
> 
> At the moment, the function kvm_get_vmid_bits() is looking up for the
> sanitized value of ID_AA64MMFR1_EL1 and extract the information
> regarding the number of VMID bits supported.
> 
> This is fine as the function is mainly used during VMID roll-over. New
> use in a follow-up patch will require the function to be called a every
> context switch so we want the function to be more efficient.
> 
> A new capability is introduced to tell whether 16-bit VMID is
> available.

I don't really buy this rationale. The VMID allocator introduced later on
caches this value in the static 'vmid_bits' variable, and that gets used
on vCPU enter via vmid_gen_match() in the kvm_arm_update_vmid() fastpath.

So I would prefer that we just expose an accessor for that than introduce
a static key and new cpufeature just for kvm_get_vttbr().

Will
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 07/12] selftests: KVM: Introduce system counter offset test

2021-07-21 Thread Andrew Jones
On Fri, Jul 16, 2021 at 09:26:24PM +, Oliver Upton wrote:
> Introduce a KVM selftest to verify that userspace manipulation of the
> TSC (via the new vCPU attribute) results in the correct behavior within
> the guest.
> 
> Signed-off-by: Oliver Upton 
> ---
>  tools/testing/selftests/kvm/.gitignore|   1 +
>  tools/testing/selftests/kvm/Makefile  |   1 +
>  .../kvm/system_counter_offset_test.c  | 133 ++
>  3 files changed, 135 insertions(+)
>  create mode 100644 tools/testing/selftests/kvm/system_counter_offset_test.c
> 
> diff --git a/tools/testing/selftests/kvm/.gitignore 
> b/tools/testing/selftests/kvm/.gitignore
> index d0877d01e771..2752813d5090 100644
> --- a/tools/testing/selftests/kvm/.gitignore
> +++ b/tools/testing/selftests/kvm/.gitignore
> @@ -50,3 +50,4 @@
>  /set_memory_region_test
>  /steal_time
>  /kvm_binary_stats_test
> +/system_counter_offset_test
> diff --git a/tools/testing/selftests/kvm/Makefile 
> b/tools/testing/selftests/kvm/Makefile
> index f7e24f334c6e..7bf2e5fb1d5a 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -83,6 +83,7 @@ TEST_GEN_PROGS_x86_64 += memslot_perf_test
>  TEST_GEN_PROGS_x86_64 += set_memory_region_test
>  TEST_GEN_PROGS_x86_64 += steal_time
>  TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test
> +TEST_GEN_PROGS_x86_64 += system_counter_offset_test
>  
>  TEST_GEN_PROGS_aarch64 += aarch64/debug-exceptions
>  TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list
> diff --git a/tools/testing/selftests/kvm/system_counter_offset_test.c 
> b/tools/testing/selftests/kvm/system_counter_offset_test.c
> new file mode 100644
> index ..7e9015770759
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/system_counter_offset_test.c
> @@ -0,0 +1,133 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2021, Google LLC.
> + *
> + * Tests for adjusting the system counter from userspace
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "test_util.h"
> +#include "kvm_util.h"
> +#include "processor.h"
> +
> +#define VCPU_ID 0
> +
> +#ifdef __x86_64__
> +
> +struct test_case {
> + uint64_t tsc_offset;
> +};
> +
> +static struct test_case test_cases[] = {
> + { 0 },
> + { 180 * NSEC_PER_SEC },
> + { -180 * NSEC_PER_SEC },
> +};
> +
> +static void check_preconditions(struct kvm_vm *vm)
> +{
> + if (!_vcpu_has_device_attr(vm, VCPU_ID, KVM_VCPU_TSC_CTRL, 
> KVM_VCPU_TSC_OFFSET))
> + return;
> +
> + print_skip("KVM_VCPU_TSC_OFFSET not supported; skipping test");
> + exit(KSFT_SKIP);
> +}
> +
> +static void setup_system_counter(struct kvm_vm *vm, struct test_case *test)
> +{
> + vcpu_access_device_attr(vm, VCPU_ID, KVM_VCPU_TSC_CTRL,
> + KVM_VCPU_TSC_OFFSET, >tsc_offset, true);
> +}
> +
> +static uint64_t guest_read_system_counter(struct test_case *test)
> +{
> + return rdtsc();
> +}
> +
> +static uint64_t host_read_guest_system_counter(struct test_case *test)
> +{
> + return rdtsc() + test->tsc_offset;
> +}
> +
> +#else /* __x86_64__ */
> +
> +#error test not implemented for this architecture!
> +
> +#endif
> +
> +#define GUEST_SYNC_CLOCK(__stage, __val) \
> + GUEST_SYNC_ARGS(__stage, __val, 0, 0, 0)
> +
> +static void guest_main(void)
> +{
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
> + struct test_case *test = _cases[i];
> +
> + GUEST_SYNC_CLOCK(i, guest_read_system_counter(test));
> + }
> +
> + GUEST_DONE();
> +}
> +
> +static void handle_sync(struct ucall *uc, uint64_t start, uint64_t end)
> +{
> + uint64_t obs = uc->args[2];
> +
> + TEST_ASSERT(start <= obs && obs <= end,
> + "unexpected system counter value: %"PRIu64" expected range: 
> [%"PRIu64", %"PRIu64"]",
> + obs, start, end);
> +
> + pr_info("system counter value: %"PRIu64" expected range [%"PRIu64", 
> %"PRIu64"]\n",
> + obs, start, end);
> +}
> +
> +static void handle_abort(struct ucall *uc)
> +{
> + TEST_FAIL("%s at %s:%ld", (const char *)uc->args[0],
> +   __FILE__, uc->args[1]);
> +}
> +
> +static void enter_guest(struct kvm_vm *vm)
> +{
> + uint64_t start, end;
> + struct ucall uc;
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
> + struct test_case *test = _cases[i];
> +
> + setup_system_counter(vm, test);
> + start = host_read_guest_system_counter(test);
> + vcpu_run(vm, VCPU_ID);
> + end = host_read_guest_system_counter(test);
> +
> + switch (get_ucall(vm, VCPU_ID, )) {
> + case UCALL_SYNC:
> + handle_sync(, start, end);
> + break;
> + case UCALL_ABORT:
> + handle_abort();
> + return;
> + case UCALL_DONE:

Re: [PATCH v2 06/12] selftests: KVM: Add helpers for vCPU device attributes

2021-07-21 Thread Andrew Jones
On Fri, Jul 16, 2021 at 09:26:23PM +, Oliver Upton wrote:
> vCPU file descriptors are abstracted away from test code in KVM
> selftests, meaning that tests cannot directly access a vCPU's device
> attributes. Add helpers that tests can use to get at vCPU device
> attributes.
> 
> Signed-off-by: Oliver Upton 
> ---
>  .../testing/selftests/kvm/include/kvm_util.h  |  9 +
>  tools/testing/selftests/kvm/lib/kvm_util.c| 38 +++
>  2 files changed, 47 insertions(+)
> 
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h 
> b/tools/testing/selftests/kvm/include/kvm_util.h
> index a8ac5d52e17b..1b3ef5757819 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -240,6 +240,15 @@ int _kvm_device_access(int dev_fd, uint32_t group, 
> uint64_t attr,
>  int kvm_device_access(int dev_fd, uint32_t group, uint64_t attr,
> void *val, bool write);
>  
> +int _vcpu_has_device_attr(struct kvm_vm *vm, uint32_t vcpuid, uint32_t group,
> +   uint64_t attr);
> +int vcpu_has_device_attr(struct kvm_vm *vm, uint32_t vcpuid, uint32_t group,
> +  uint64_t attr);
> +int _vcpu_access_device_attr(struct kvm_vm *vm, uint32_t vcpuid, uint32_t 
> group,
> +   uint64_t attr, void *val, bool write);
> +int vcpu_access_device_attr(struct kvm_vm *vm, uint32_t vcpuid, uint32_t 
> group,
> +  uint64_t attr, void *val, bool write);
> +
>  const char *exit_reason_str(unsigned int exit_reason);
>  
>  void virt_pgd_alloc(struct kvm_vm *vm);
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
> b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 10a8ed691c66..b595e7dc3fc5 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -2040,6 +2040,44 @@ int kvm_device_access(int dev_fd, uint32_t group, 
> uint64_t attr,
>   return ret;
>  }
>  
> +int _vcpu_has_device_attr(struct kvm_vm *vm, uint32_t vcpuid, uint32_t group,
> +   uint64_t attr)
> +{
> + struct vcpu *vcpu = vcpu_find(vm, vcpuid);
> +
> + TEST_ASSERT(vcpu, "nonexistent vcpu id: %d", vcpuid);
> +
> + return _kvm_device_check_attr(vcpu->fd, group, attr);
> +}
> +
> +int vcpu_has_device_attr(struct kvm_vm *vm, uint32_t vcpuid, uint32_t group,
> +  uint64_t attr)
> +{
> + int ret = _vcpu_has_device_attr(vm, vcpuid, group, attr);
> +
> + TEST_ASSERT(!ret, "KVM_HAS_DEVICE_ATTR IOCTL failed, rc: %i errno: %i", 
> ret, errno);
> + return ret;
> +}
> +
> +int _vcpu_access_device_attr(struct kvm_vm *vm, uint32_t vcpuid, uint32_t 
> group,
> +  uint64_t attr, void *val, bool write)
> +{
> + struct vcpu *vcpu = vcpu_find(vm, vcpuid);
> +
> + TEST_ASSERT(vcpu, "nonexistent vcpu id: %d", vcpuid);
> +
> + return _kvm_device_access(vcpu->fd, group, attr, val, write);
> +}
> +
> +int vcpu_access_device_attr(struct kvm_vm *vm, uint32_t vcpuid, uint32_t 
> group,
> + uint64_t attr, void *val, bool write)
> +{
> + int ret = _vcpu_access_device_attr(vm, vcpuid, group, attr, val, write);
> +
> + TEST_ASSERT(!ret, "KVM_SET|GET_DEVICE_ATTR IOCTL failed, rc: %i errno: 
> %i", ret, errno);
> + return ret;
> +}


Reviewed-by: Andrew Jones 


The 'assert !ret's are correct here. I see they are not correct in 

 kvm_device_check_attr
 kvm_create_device
 kvm_device_access

though, as they are 'assert ret >= 0', but the documentation says 0 on
success. It'd be nice to get that fixed before we build more API on top
of it.

Thanks,
drew


> +
>  /*
>   * VM Dump
>   *
> -- 
> 2.32.0.402.g57bb445576-goog
> 
> ___
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 05/12] selftests: KVM: Add test for KVM_{GET, SET}_CLOCK

2021-07-21 Thread Andrew Jones
On Fri, Jul 16, 2021 at 09:26:22PM +, Oliver Upton wrote:
> Add a selftest for the new KVM clock UAPI that was introduced. Ensure
> that the KVM clock is consistent between userspace and the guest, and
> that the difference in realtime will only ever cause the KVM clock to
> advance forward.
> 
> Signed-off-by: Oliver Upton 
> ---
>  tools/testing/selftests/kvm/.gitignore|   1 +
>  tools/testing/selftests/kvm/Makefile  |   1 +
>  .../testing/selftests/kvm/include/kvm_util.h  |   2 +
>  .../selftests/kvm/x86_64/kvm_clock_test.c | 210 ++
>  4 files changed, 214 insertions(+)
>  create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_clock_test.c
> 
> diff --git a/tools/testing/selftests/kvm/.gitignore 
> b/tools/testing/selftests/kvm/.gitignore
> index 06a351b4f93b..d0877d01e771 100644
> --- a/tools/testing/selftests/kvm/.gitignore
> +++ b/tools/testing/selftests/kvm/.gitignore
> @@ -11,6 +11,7 @@
>  /x86_64/emulator_error_test
>  /x86_64/get_cpuid_test
>  /x86_64/get_msr_index_features
> +/x86_64/kvm_clock_test
>  /x86_64/kvm_pv_test
>  /x86_64/hyperv_clock
>  /x86_64/hyperv_cpuid
> diff --git a/tools/testing/selftests/kvm/Makefile 
> b/tools/testing/selftests/kvm/Makefile
> index b853be2ae3c6..f7e24f334c6e 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -46,6 +46,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/get_cpuid_test
>  TEST_GEN_PROGS_x86_64 += x86_64/hyperv_clock
>  TEST_GEN_PROGS_x86_64 += x86_64/hyperv_cpuid
>  TEST_GEN_PROGS_x86_64 += x86_64/hyperv_features
> +TEST_GEN_PROGS_x86_64 += x86_64/kvm_clock_test
>  TEST_GEN_PROGS_x86_64 += x86_64/kvm_pv_test
>  TEST_GEN_PROGS_x86_64 += x86_64/mmio_warning_test
>  TEST_GEN_PROGS_x86_64 += x86_64/mmu_role_test
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h 
> b/tools/testing/selftests/kvm/include/kvm_util.h
> index 010b59b13917..a8ac5d52e17b 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -19,6 +19,8 @@
>  #define KVM_DEV_PATH "/dev/kvm"
>  #define KVM_MAX_VCPUS 512
>  
> +#define NSEC_PER_SEC 10L
> +
>  /*
>   * Callers of kvm_util only have an incomplete/opaque description of the
>   * structure kvm_util is using to maintain the state of a VM.
> diff --git a/tools/testing/selftests/kvm/x86_64/kvm_clock_test.c 
> b/tools/testing/selftests/kvm/x86_64/kvm_clock_test.c
> new file mode 100644
> index ..34c48f2dde54
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/x86_64/kvm_clock_test.c
> @@ -0,0 +1,210 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2021, Google LLC.
> + *
> + * Tests for adjusting the KVM clock from userspace
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "test_util.h"
> +#include "kvm_util.h"
> +#include "processor.h"
> +
> +#define VCPU_ID 0
> +
> +struct test_case {
> + uint64_t kvmclock_base;
> + int64_t realtime_offset;
> +};
> +
> +static struct test_case test_cases[] = {
> + { .kvmclock_base = 0 },
> + { .kvmclock_base = 180 * NSEC_PER_SEC },
> + { .kvmclock_base = 0, .realtime_offset = -180 * NSEC_PER_SEC },
> + { .kvmclock_base = 0, .realtime_offset = 180 * NSEC_PER_SEC },
> +};
> +
> +#define GUEST_SYNC_CLOCK(__stage, __val) \
> + GUEST_SYNC_ARGS(__stage, __val, 0, 0, 0)
> +
> +static void guest_main(vm_paddr_t pvti_pa, struct pvclock_vcpu_time_info 
> *pvti)
> +{
> + int i;
> +
> + wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
> + for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
> + GUEST_SYNC_CLOCK(i, __pvclock_read_cycles(pvti, rdtsc()));
> + }
> +
> + GUEST_DONE();
> +}
> +
> +#define EXPECTED_FLAGS (KVM_CLOCK_REAL_TIME | KVM_CLOCK_HOST_TSC)
> +
> +static inline void assert_flags(struct kvm_clock_data *data)
> +{
> + TEST_ASSERT((data->flags & EXPECTED_FLAGS) == EXPECTED_FLAGS,
> + "unexpected clock data flags: %x (want set: %x)",
> + data->flags, EXPECTED_FLAGS);
> +}
> +
> +static void handle_sync(struct ucall *uc, struct kvm_clock_data *start,
> + struct kvm_clock_data *end)
> +{
> + uint64_t obs, exp_lo, exp_hi;
> +
> + obs = uc->args[2];
> + exp_lo = start->clock;
> + exp_hi = end->clock;
> +
> + assert_flags(start);
> + assert_flags(end);
> +
> + TEST_ASSERT(exp_lo <= obs && obs <= exp_hi,
> + "unexpected kvm-clock value: %"PRIu64" expected range: 
> [%"PRIu64", %"PRIu64"]",
> + obs, exp_lo, exp_hi);
> +
> + pr_info("kvm-clock value: %"PRIu64" expected range [%"PRIu64", 
> %"PRIu64"]\n",
> + obs, exp_lo, exp_hi);
> +}
> +
> +static void handle_abort(struct ucall *uc)
> +{
> + TEST_FAIL("%s at %s:%ld", (const char *)uc->args[0],
> +   __FILE__, uc->args[1]);
> +}
> +
> +static void 

Re: [PATCH 1/5] KVM: arm64: Walk userspace page tables to compute the THP mapping size

2021-07-21 Thread Will Deacon
Hey Sean,

On Tue, Jul 20, 2021 at 08:33:46PM +, Sean Christopherson wrote:
> On Tue, Jul 20, 2021, Alexandru Elisei wrote:
> > I just can't figure out why having the mmap lock is not needed to walk the
> > userspace page tables. Any hints? Or am I not seeing where it's taken?
> 
> Disclaimer: I'm not super familiar with arm64's page tables, but the relevant 
> KVM
> functionality is common across x86 and arm64.

No need for the disclaimer, there are so many moving parts here that I don't
think it's possible to be familiar with them all! Thanks for taking the time
to write it up so clearly.

> KVM arm64 (and x86) unconditionally registers a mmu_notifier for the mm_struct
> associated with the VM, and disallows calling ioctls from a different process,
> i.e. walking the page tables during KVM_RUN is guaranteed to use the mm for 
> which
> KVM registered the mmu_notifier.  As part of registration, the mmu_notifier
> does mmgrab() and doesn't do mmdrop() until it's unregistered.  That ensures 
> the
> mm_struct itself is live.
> 
> For the page tables liveliness, KVM implements mmu_notifier_ops.release, 
> which is
> invoked at the beginning of exit_mmap(), before the page tables are freed.  In
> its implementation, KVM takes mmu_lock and zaps all its shadow page tables, 
> a.k.a.
> the stage2 tables in KVM arm64.  The flow in question, 
> get_user_mapping_size(),
> also runs under mmu_lock, and so effectively blocks exit_mmap() and thus is
> guaranteed to run with live userspace tables.

Unless I missed a case, exit_mmap() only runs when mm_struct::mm_users drops
to zero, right? The vCPU tasks should hold references to that afaict, so I
don't think it should be possible for exit_mmap() to run while there are
vCPUs running with the corresponding page-table.

> Looking at the arm64 code, one thing I'm not clear on is whether arm64 
> correctly
> handles the case where exit_mmap() wins the race.  The invalidate_range hooks 
> will
> still be called, so userspace page tables aren't a problem, but
> kvm_arch_flush_shadow_all() -> kvm_free_stage2_pgd() nullifies mmu->pgt 
> without
> any additional notifications that I see.  x86 deals with this by ensuring its
> top-level TDP entry (stage2 equivalent) is valid while the page fault handler 
> is
> running.

But the fact that x86 handles this race has me worried. What am I missing?

I agree that, if the race can occur, we don't appear to handle it in the
arm64 backend.

Cheers,

Will
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 13/14] KVM: arm64: Restrict hyp stage-1 manipulation in protected mode

2021-07-21 Thread Fuad Tabba
Hi Quentin,

On Mon, Jul 19, 2021 at 11:48 AM Quentin Perret  wrote:
>
> The host kernel is currently able to change EL2 stage-1 mappings without
> restrictions thanks to the __pkvm_create_mappings() hypercall. But in a
> world where the host is no longer part of the TCB, this clearly poses a
> problem.
>
> To fix this, introduce a new hypercall to allow the host to share a
> range of physical memory with the hypervisor, and remove the
> __pkvm_create_mappings() variant. The new hypercall implements
> ownership and permission checks before allowing the sharing operation,
> and it annotates the shared pages in the hypervisor stage-1 and host
> stage-2 page-tables.
>
> Signed-off-by: Quentin Perret 
> ---
>  arch/arm64/include/asm/kvm_asm.h  |   2 +-
>  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |   1 +
>  arch/arm64/kvm/hyp/include/nvhe/mm.h  |   2 -
>  arch/arm64/kvm/hyp/nvhe/hyp-main.c|  12 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c | 105 ++
>  arch/arm64/kvm/hyp/nvhe/mm.c  |   4 +-
>  arch/arm64/kvm/mmu.c  |  14 ++-
>  7 files changed, 124 insertions(+), 16 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_asm.h 
> b/arch/arm64/include/asm/kvm_asm.h
> index 9f0bf2109be7..78db818ae2c9 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -59,7 +59,7 @@
>  #define __KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs  13
>  #define __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_aprs   14
>  #define __KVM_HOST_SMCCC_FUNC___pkvm_init  15
> -#define __KVM_HOST_SMCCC_FUNC___pkvm_create_mappings   16
> +#define __KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp16
>  #define __KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping17
>  #define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector18
>  #define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h 
> b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> index b39047463075..f37e4d3b831b 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> @@ -22,6 +22,7 @@ extern struct host_kvm host_kvm;
>
>  int __pkvm_prot_finalize(void);
>  int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end);
> +int __pkvm_host_share_hyp(phys_addr_t start, phys_addr_t end);
>
>  int host_stage2_idmap_locked(u64 start, u64 end, enum kvm_pgtable_prot prot);
>  int kvm_host_prepare_stage2(void *pgt_pool_base);
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h 
> b/arch/arm64/kvm/hyp/include/nvhe/mm.h
> index c76d7136ed9b..c9a8f535212e 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/mm.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
> @@ -24,8 +24,6 @@ int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, 
> phys_addr_t back);
>  int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
>  int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
>  int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot 
> prot);
> -int __pkvm_create_mappings(unsigned long start, unsigned long size,
> -  unsigned long phys, enum kvm_pgtable_prot prot);
>  unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
> enum kvm_pgtable_prot prot);
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c 
> b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> index 1632f001f4ed..f05ecbd382d0 100644
> --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> @@ -140,14 +140,12 @@ static void handle___pkvm_cpu_set_vector(struct 
> kvm_cpu_context *host_ctxt)
> cpu_reg(host_ctxt, 1) = pkvm_cpu_set_vector(slot);
>  }
>
> -static void handle___pkvm_create_mappings(struct kvm_cpu_context *host_ctxt)
> +static void handle___pkvm_host_share_hyp(struct kvm_cpu_context *host_ctxt)
>  {
> -   DECLARE_REG(unsigned long, start, host_ctxt, 1);
> -   DECLARE_REG(unsigned long, size, host_ctxt, 2);
> -   DECLARE_REG(unsigned long, phys, host_ctxt, 3);
> -   DECLARE_REG(enum kvm_pgtable_prot, prot, host_ctxt, 4);
> +   DECLARE_REG(phys_addr_t, start, host_ctxt, 1);
> +   DECLARE_REG(phys_addr_t, end, host_ctxt, 2);
>
> -   cpu_reg(host_ctxt, 1) = __pkvm_create_mappings(start, size, phys, 
> prot);
> +   cpu_reg(host_ctxt, 1) = __pkvm_host_share_hyp(start, end);
>  }
>
>  static void handle___pkvm_create_private_mapping(struct kvm_cpu_context 
> *host_ctxt)
> @@ -193,7 +191,7 @@ static const hcall_t host_hcall[] = {
> HANDLE_FUNC(__vgic_v3_restore_aprs),
> HANDLE_FUNC(__pkvm_init),
> HANDLE_FUNC(__pkvm_cpu_set_vector),
> -   HANDLE_FUNC(__pkvm_create_mappings),
> +   HANDLE_FUNC(__pkvm_host_share_hyp),
> HANDLE_FUNC(__pkvm_create_private_mapping),
> HANDLE_FUNC(__pkvm_prot_finalize),
>   

Re: [PATCH v3 14/15] KVM: arm64: Handle protected guests at 32 bits

2021-07-21 Thread Fuad Tabba
Hi Oliver,

On Mon, Jul 19, 2021 at 8:43 PM Oliver Upton  wrote:
>
> On Mon, Jul 19, 2021 at 9:04 AM Fuad Tabba  wrote:
> >
> > Protected KVM does not support protected AArch32 guests. However,
> > it is possible for the guest to force run AArch32, potentially
> > causing problems. Add an extra check so that if the hypervisor
> > catches the guest doing that, it can prevent the guest from
> > running again by resetting vcpu->arch.target and returning
> > ARM_EXCEPTION_IL.
> >
> > Adapted from commit 22f553842b14 ("KVM: arm64: Handle Asymmetric
> > AArch32 systems")
> >
> > Signed-off-by: Fuad Tabba 
>
> Would it make sense to document how we handle misbehaved guests, in
> case a particular VMM wants to clean up the mess afterwards?

I agree, especially since with this patch this could happen in more
than one place.

Thanks,
/fuad

> --
> Thanks,
> Oliver
>
> > ---
> >  arch/arm64/kvm/hyp/include/hyp/switch.h | 24 
> >  1 file changed, 24 insertions(+)
> >
> > diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h 
> > b/arch/arm64/kvm/hyp/include/hyp/switch.h
> > index 8431f1514280..f09343e15a80 100644
> > --- a/arch/arm64/kvm/hyp/include/hyp/switch.h
> > +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
> > @@ -23,6 +23,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -477,6 +478,29 @@ static inline bool fixup_guest_exit(struct kvm_vcpu 
> > *vcpu, u64 *exit_code)
> > write_sysreg_el2(read_sysreg_el2(SYS_ELR) - 4, 
> > SYS_ELR);
> > }
> >
> > +   /*
> > +* Protected VMs might not be allowed to run in AArch32. The check 
> > below
> > +* is based on the one in kvm_arch_vcpu_ioctl_run().
> > +* The ARMv8 architecture doesn't give the hypervisor a mechanism to
> > +* prevent a guest from dropping to AArch32 EL0 if implemented by 
> > the
> > +* CPU. If the hypervisor spots a guest in such a state ensure it is
> > +* handled, and don't trust the host to spot or fix it.
> > +*/
> > +   if (unlikely(is_nvhe_hyp_code() &&
> > +kvm_vm_is_protected(kern_hyp_va(vcpu->kvm)) &&
> > +FIELD_GET(FEATURE(ID_AA64PFR0_EL0),
> > +  PVM_ID_AA64PFR0_ALLOW) <
> > +ID_AA64PFR0_ELx_32BIT_64BIT &&
> > +vcpu_mode_is_32bit(vcpu))) {
> > +   /*
> > +* As we have caught the guest red-handed, decide that it 
> > isn't
> > +* fit for purpose anymore by making the vcpu invalid.
> > +*/
> > +   vcpu->arch.target = -1;
> > +   *exit_code = ARM_EXCEPTION_IL;
> > +   goto exit;
> > +   }
> > +
> > /*
> >  * We're using the raw exception code in order to only process
> >  * the trap if no SError is pending. We will come back to the
> > --
> > 2.32.0.402.g57bb445576-goog
> >
> > ___
> > kvmarm mailing list
> > kvmarm@lists.cs.columbia.edu
> > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 12/14] KVM: arm64: Refactor pkvm_pgtable locking

2021-07-21 Thread Fuad Tabba
Hi Quentin,

On Mon, Jul 19, 2021 at 11:48 AM Quentin Perret  wrote:
>
> Refactor the hypervisor stage-1 locking in nVHE protected mode to expose
> a new pkvm_create_mappings_locked() function. This will be used in later
> patches to allow walking and changing the hypervisor stage-1 without
> releasing the lock.
>
> Signed-off-by: Quentin Perret 
> ---
>  arch/arm64/kvm/hyp/include/nvhe/mm.h |  1 +
>  arch/arm64/kvm/hyp/nvhe/mm.c | 16 ++--
>  2 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h 
> b/arch/arm64/kvm/hyp/include/nvhe/mm.h
> index 8ec3a5a7744b..c76d7136ed9b 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/mm.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
> @@ -23,6 +23,7 @@ int hyp_map_vectors(void);
>  int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
>  int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
>  int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
> +int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot 
> prot);
>  int __pkvm_create_mappings(unsigned long start, unsigned long size,
>unsigned long phys, enum kvm_pgtable_prot prot);
>  unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,

The locking logic seems to be consistent, with pkvm_create_mappings()
holding the lock for the whole duration of the operation rather than
per-iteration.

It would be nice though to document which lock should be held for the
_locked versions.

Thanks,
/fuad




> diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
> index a8efdf0f9003..dde22e2a322a 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mm.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mm.c
> @@ -67,7 +67,7 @@ unsigned long __pkvm_create_private_mapping(phys_addr_t 
> phys, size_t size,
> return addr;
>  }
>
> -int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
> +int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot 
> prot)
>  {
> unsigned long start = (unsigned long)from;
> unsigned long end = (unsigned long)to;
> @@ -81,7 +81,8 @@ int pkvm_create_mappings(void *from, void *to, enum 
> kvm_pgtable_prot prot)
> int err;
>
> phys = hyp_virt_to_phys((void *)virt_addr);
> -   err = __pkvm_create_mappings(virt_addr, PAGE_SIZE, phys, 
> prot);
> +   err = kvm_pgtable_hyp_map(_pgtable, virt_addr, PAGE_SIZE,
> + phys, prot);
> if (err)
> return err;
> }
> @@ -89,6 +90,17 @@ int pkvm_create_mappings(void *from, void *to, enum 
> kvm_pgtable_prot prot)
> return 0;
>  }
>
> +int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
> +{
> +   int ret;
> +
> +   hyp_spin_lock(_pgd_lock);
> +   ret = pkvm_create_mappings_locked(from, to, prot);
> +   hyp_spin_unlock(_pgd_lock);
> +
> +   return ret;
> +}
> +
>  int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back)
>  {
> unsigned long start, end;
> --
> 2.32.0.402.g57bb445576-goog
>
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 11/14] KVM: arm64: Expose kvm_pte_valid() helper

2021-07-21 Thread Fuad Tabba
Hi Quentin,


On Mon, Jul 19, 2021 at 11:48 AM Quentin Perret  wrote:
>
> The KVM pgtable API exposes the kvm_pgtable_walk() function to allow
> the definition of walkers outside of pgtable.c. However, it is not easy
> to implement any of those walkers without some of the low-level helpers,
> such as kvm_pte_valid(). Make it static inline, and move it to the
> header file to allow its re-use in other places.
>
> Signed-off-by: Quentin Perret 
> ---
>  arch/arm64/include/asm/kvm_pgtable.h | 7 +++
>  arch/arm64/kvm/hyp/pgtable.c | 6 --
>  2 files changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
> b/arch/arm64/include/asm/kvm_pgtable.h
> index 1aa49d6aabb7..8240c881ae1e 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -25,6 +25,13 @@ static inline u64 kvm_get_parange(u64 mmfr0)
>
>  typedef u64 kvm_pte_t;
>
> +#define KVM_PTE_VALID  BIT(0)
> +

I don't know if there's a better solution for this, but having the
KVM_PTE_VALID by itself here, with the rest remaining in pgtable.c
might be confusing. I see that you probably don't want to move them
all here because they are internal to pgtable.c.

Thanks,
/fuad

> +static inline bool kvm_pte_valid(kvm_pte_t pte)
> +{
> +   return pte & KVM_PTE_VALID;
> +}
> +
>  /**
>   * struct kvm_pgtable_mm_ops - Memory management callbacks.
>   * @zalloc_page:   Allocate a single zeroed memory page.
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index c7120797404a..e0ae57dca827 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -11,7 +11,6 @@
>  #include 
>  #include 
>
> -#define KVM_PTE_VALID  BIT(0)
>
>  #define KVM_PTE_TYPE   BIT(1)
>  #define KVM_PTE_TYPE_BLOCK 0
> @@ -135,11 +134,6 @@ static u32 kvm_pgd_pages(u32 ia_bits, u32 start_level)
> return __kvm_pgd_page_idx(, -1ULL) + 1;
>  }
>
> -static bool kvm_pte_valid(kvm_pte_t pte)
> -{
> -   return pte & KVM_PTE_VALID;
> -}
> -
>  static bool kvm_pte_table(kvm_pte_t pte, u32 level)
>  {
> if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> --
> 2.32.0.402.g57bb445576-goog
>
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH V8 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks

2021-07-21 Thread Like Xu

On 16/7/2021 4:53 pm, Zhu Lingshan wrote:

+   } else if (xenpmu_data->pmu.r.regs.cpl & 3)


Lingshan, serious for this version ?

arch/x86/xen/pmu.c:438:9: error: expected identifier or ‘(’ before ‘return’
  438 | return state;
  | ^~
arch/x86/xen/pmu.c:439:1: error: expected identifier or ‘(’ before ‘}’ token
  439 | }
  | ^
arch/x86/xen/pmu.c: In function ‘xen_guest_state’:
arch/x86/xen/pmu.c:436:9: error: control reaches end of non-void 
function [-Werror=return-type]

  436 | }
  | ^
cc1: some warnings being treated as errors


+   state |= PERF_GUEST_USER;
}

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCHv2 4/4] arm64: add host pv-vcpu-state support

2021-07-21 Thread Sergey Senozhatsky
On (21/07/21 09:40), Marc Zyngier wrote:
> > 
> > Can that be cured by just checking vcpu->preempted before calling
> > kvm_update_vcpu_preempted() ?
> 
> It isn't obvious to me that this is the right thing to do.
> vcpu->preempted is always updated on sched-out from the preempt
> notifier if the vcpu was on the run-queue, so my guess is that it will
> always be set when switching to another task.
> 
> What you probably want is to check whether the vcpu is blocked by
> introspecting the wait-queue with:
> 
>   scuwait_active(kvm_arch_vcpu_get_wait(vcpu)
> 
> which will tell you whether you are blocking or not. We are already
> using a similar construct for arming a background timer in this case.

Can we examine if vcpu->run->exit_reason == WFE/WFI and avoid setting
preempted state if so?
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCHv2 3/4] arm64: do not use dummy vcpu_is_preempted()

2021-07-21 Thread Sergey Senozhatsky
On (21/07/12 16:47), Marc Zyngier wrote:
> >  #include 
> >  #include 
> > +#include 
> >  
> >  /* See include/linux/spinlock.h */
> >  #define smp_mb__after_spinlock()   smp_mb()
> >  
> > -/*
> > - * Changing this will break osq_lock() thanks to the call inside
> > - * smp_cond_load_relaxed().
> > - *
> > - * See:
> > - * 
> > https://lore.kernel.org/lkml/20200110100612.gc2...@hirez.programming.kicks-ass.net
> > - */
> 
> Why are you deleting this? Please explain your reasoning in the commit
> message. It seems to me that it still makes complete sense when
> CONFIG_PARAVIRT is not defined.

You are right. I'll move it to !PARAVIRT #else-branch.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCHv2 2/4] arm64: add guest pvstate support

2021-07-21 Thread Sergey Senozhatsky
On (21/07/21 09:22), Marc Zyngier wrote:
> On Wed, 21 Jul 2021 03:05:25 +0100,
> Sergey Senozhatsky  wrote:
> > 
> > On (21/07/12 16:42), Marc Zyngier wrote:
> > > > 
> > > > PV-vcpu-state is a per-CPU struct, which, for the time being,
> > > > holds boolean `preempted' vCPU state. During the startup,
> > > > given that host supports PV-state, each guest vCPU sends
> > > > a pointer to its per-CPU variable to the host as a payload
> > > 
> > > What is the expected memory type for this memory region? What is its
> > > life cycle? Where is it allocated from?
> > 
> > Guest per-CPU area, which physical addresses is shared with the
> > host.
> 
> Again: what are the memory types you expect this to be used with?

I heard your questions, I'm trying to figure out the answers now.

As of memory type - I presume you are talking about coherent vs
non-coherent memory. Can guest per-CPU memory be non-coherent? Guest
never writes anything to the region of memory it shares with the host,
it only reads what the host writes to it. All reads and writes are
done from CPU (no devices DMA access, etc).

Do we need any cache flushes/syncs in this case?

> When will the hypervisor ever stop accessing this?

KVM always access it for the vcpus that are getting scheduled out or
scheduled in on the host side.

> How does it work across reset?

I need to figure out what happens during reset/migration in the first
place.

> I'm sorry to be that pressing, but these are the gory details that
> actually matter if you want this thing to be maintainable.

Sure, no problem.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCHv2 2/4] arm64: add guest pvstate support

2021-07-21 Thread Sergey Senozhatsky
On (21/07/12 16:42), Marc Zyngier wrote:
> > 
> > PV-vcpu-state is a per-CPU struct, which, for the time being,
> > holds boolean `preempted' vCPU state. During the startup,
> > given that host supports PV-state, each guest vCPU sends
> > a pointer to its per-CPU variable to the host as a payload
> 
> What is the expected memory type for this memory region? What is its
> life cycle? Where is it allocated from?

Guest per-CPU area, which physical addresses is shared with the host.

> > with the SMCCC HV call, so that host can update vCPU state
> > when it puts or loads vCPU.
> > 
> > This has impact on the guest's scheduler:
> > 
> > [..]
> >   wake_up_process()
> >try_to_wake_up()
> > select_task_rq_fair()
> >  available_idle_cpu()
> >   vcpu_is_preempted()
> > 
> > Some sched benchmarks data is available on the github page [0].
> > 
> > [0] https://github.com/sergey-senozhatsky/arm64-vcpu_is_preempted
> 
> Please include these results in the cover letter. I tend to reply to
> email while offline, and I can't comment on GH.

ACK.

> > +struct vcpu_state {
> 
> If this is KVM specific (which it most likely is), please name-space
> it correctly, and move it to a KVM-specific location.

ACK.

> > +   boolpreempted;
> > +   u8  reserved[63];
> 
> Why 63? Do you attach any particular meaning to a 64byte structure
> (and before you say "cache line size", please look at some of the
> cache line sizes we have to deal with...).

We do have some future plans to share some bits of the guest's context
with the host.

> This should also be versioned from day-1, one way or another.

Makes sense.

> > +};
> > +
> >  #ifdef CONFIG_PARAVIRT
> >  #include 
> >  
> > @@ -20,8 +25,22 @@ static inline u64 paravirt_steal_clock(int cpu)
> >  
> >  int __init pv_time_init(void);
> >  
> > +bool dummy_vcpu_is_preempted(unsigned int cpu);
> > +
> > +extern struct static_key pv_vcpu_is_preempted_enabled;
> > +DECLARE_STATIC_CALL(pv_vcpu_is_preempted, dummy_vcpu_is_preempted);
> > +
> > +static inline bool paravirt_vcpu_is_preempted(unsigned int cpu)
> > +{
> > +   return static_call(pv_vcpu_is_preempted)(cpu);
> > +}
> > +
> > +int __init pv_vcpu_state_init(void);
> > +
> >  #else
> >  
> > +#define pv_vcpu_state_init() do {} while (0)
> > +
> >  #define pv_time_init() do {} while (0)
> >  
> >  #endif // CONFIG_PARAVIRT
> > diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c
> > index 75fed4460407..d8fc46795d94 100644
> > --- a/arch/arm64/kernel/paravirt.c
> > +++ b/arch/arm64/kernel/paravirt.c
> > @@ -40,6 +40,11 @@ struct pv_time_stolen_time_region {
> >  
> >  static DEFINE_PER_CPU(struct pv_time_stolen_time_region, 
> > stolen_time_region);
> >  
> > +static DEFINE_PER_CPU(struct vcpu_state, vcpus_states);
> 
> nit: there is only one 'state' structure per CPU, so I'd prefer the
> singular form.

ACK.

> > +struct static_key pv_vcpu_is_preempted_enabled;
> > +
> > +DEFINE_STATIC_CALL(pv_vcpu_is_preempted, dummy_vcpu_is_preempted);
> > +
> >  static bool steal_acc = true;
> >  static int __init parse_no_stealacc(char *arg)
> >  {
> > @@ -165,3 +170,92 @@ int __init pv_time_init(void)
> >  
> > return 0;
> >  }
> > +
> > +bool dummy_vcpu_is_preempted(unsigned int cpu)
> 
> Why does this have to be global?

I think this can be moved away from the header, so then we don't need
to DECLARE_STATIC_CALL() with a dummy function.

> > +static bool has_pv_vcpu_state(void)
> > +{
> > +   struct arm_smccc_res res;
> > +
> > +   /* To detect the presence of PV time support we require SMCCC 1.1+ */
> > +   if (arm_smccc_1_1_get_conduit() == SMCCC_CONDUIT_NONE)
> > +   return false;
> > +
> > +   arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
> > +ARM_SMCCC_HV_PV_VCPU_STATE_FEATURES,
> > +);
> > +
> > +   if (res.a0 != SMCCC_RET_SUCCESS)
> > +   return false;
> > +   return true;
> 
> Please move all this over the the KVM-specific discovery mechanism.

Will take a look.

> > +static int __pv_vcpu_state_hook(unsigned int cpu, int event)
> > +{
> > +   struct arm_smccc_res res;
> > +   struct vcpu_state *st;
> > +
> > +   st = _cpu(vcpus_states, cpu);
> > +   arm_smccc_1_1_invoke(event, virt_to_phys(st), );
> > +   if (res.a0 != SMCCC_RET_SUCCESS)
> > +   return -EINVAL;
> > +   return 0;
> > +}
> > +
> > +static int vcpu_state_init(unsigned int cpu)
> > +{
> > +   int ret = __pv_vcpu_state_hook(cpu, ARM_SMCCC_HV_PV_VCPU_STATE_INIT);
> > +
> > +   if (ret)
> > +   pr_warn("Unable to ARM_SMCCC_HV_PV_STATE_INIT\n");
> 
> pr_warn_once(), please.

ACK.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCHv2 4/4] arm64: add host pv-vcpu-state support

2021-07-21 Thread Sergey Senozhatsky
On (21/07/12 17:24), Marc Zyngier wrote:
> >  
> >  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> >  {
> > +   kvm_update_vcpu_preempted(vcpu, true);
> 
> This doesn't look right. With this, you are now telling the guest that
> a vcpu that is blocked on WFI is preempted. This really isn't the
> case, as it has voluntarily entered a low-power mode while waiting for
> an interrupt. Indeed, the vcpu isn't running. A physical CPU wouldn't
> be running either.

I suppose you are talking about kvm_vcpu_block(). Well, it checks
kvm_vcpu_check_block() but then it simply schedule() out the vcpu
process, which does look like "the vcpu is preempted". Once we
sched_in() that vcpu process again we mark it as non-preempted,
even though it remains in kvm wfx handler. Why isn't it right?

Another call path is iret:


__schedule()
 context_switch()
  prepare_task_switch()
   fire_sched_in_preempt_notifiers()
kvm_sched_out()
 kvm_arch_vcpu_put()
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 06/15] KVM: arm64: Restore mdcr_el2 from vcpu

2021-07-21 Thread Fuad Tabba
Hi Drew,

On Tue, Jul 20, 2021 at 3:53 PM Andrew Jones  wrote:
>
> On Mon, Jul 19, 2021 at 05:03:37PM +0100, Fuad Tabba wrote:
> > On deactivating traps, restore the value of mdcr_el2 from the
> > newly created and preserved host value vcpu context, rather than
> > directly reading the hardware register.
> >
> > Up until and including this patch the two values are the same,
> > i.e., the hardware register and the vcpu one. A future patch will
> > be changing the value of mdcr_el2 on activating traps, and this
> > ensures that its value will be restored.
> >
> > No functional change intended.
>
> I'm probably missing something, but I can't convince myself that the host
> will end up with the same mdcr_el2 value after deactivating traps after
> this patch as before. We clearly now restore whatever we had when
> activating traps (presumably whatever we configured at init_el2_state
> time), but is that equivalent to what we had before with the masking and
> ORing that this patch drops?

You're right. I thought that these were actually being initialized to
the same values, but having a closer look at the code the mdcr values
are not the same as pre-patch. I will fix this.

Thanks!
/fuad

> Thanks,
> drew
>
> >
> > Signed-off-by: Fuad Tabba 
> > ---
> >  arch/arm64/include/asm/kvm_host.h   |  5 -
> >  arch/arm64/include/asm/kvm_hyp.h|  2 +-
> >  arch/arm64/kvm/hyp/include/hyp/switch.h |  6 +-
> >  arch/arm64/kvm/hyp/nvhe/switch.c| 11 ++-
> >  arch/arm64/kvm/hyp/vhe/switch.c | 12 ++--
> >  arch/arm64/kvm/hyp/vhe/sysreg-sr.c  |  2 +-
> >  6 files changed, 15 insertions(+), 23 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h 
> > b/arch/arm64/include/asm/kvm_host.h
> > index 4d2d974c1522..76462c6a91ee 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -287,10 +287,13 @@ struct kvm_vcpu_arch {
> >   /* Stage 2 paging state used by the hardware on next switch */
> >   struct kvm_s2_mmu *hw_mmu;
> >
> > - /* HYP configuration */
> > + /* Values of trap registers for the guest. */
> >   u64 hcr_el2;
> >   u64 mdcr_el2;
> >
> > + /* Values of trap registers for the host before guest entry. */
> > + u64 mdcr_el2_host;
> > +
> >   /* Exception Information */
> >   struct kvm_vcpu_fault_info fault;
> >
> > diff --git a/arch/arm64/include/asm/kvm_hyp.h 
> > b/arch/arm64/include/asm/kvm_hyp.h
> > index 9d60b3006efc..657d0c94cf82 100644
> > --- a/arch/arm64/include/asm/kvm_hyp.h
> > +++ b/arch/arm64/include/asm/kvm_hyp.h
> > @@ -95,7 +95,7 @@ void __sve_restore_state(void *sve_pffr, u32 *fpsr);
> >
> >  #ifndef __KVM_NVHE_HYPERVISOR__
> >  void activate_traps_vhe_load(struct kvm_vcpu *vcpu);
> > -void deactivate_traps_vhe_put(void);
> > +void deactivate_traps_vhe_put(struct kvm_vcpu *vcpu);
> >  #endif
> >
> >  u64 __guest_enter(struct kvm_vcpu *vcpu);
> > diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h 
> > b/arch/arm64/kvm/hyp/include/hyp/switch.h
> > index e4a2f295a394..a0e78a6027be 100644
> > --- a/arch/arm64/kvm/hyp/include/hyp/switch.h
> > +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
> > @@ -92,11 +92,15 @@ static inline void __activate_traps_common(struct 
> > kvm_vcpu *vcpu)
> >   write_sysreg(0, pmselr_el0);
> >   write_sysreg(ARMV8_PMU_USERENR_MASK, pmuserenr_el0);
> >   }
> > +
> > + vcpu->arch.mdcr_el2_host = read_sysreg(mdcr_el2);
> >   write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
> >  }
> >
> > -static inline void __deactivate_traps_common(void)
> > +static inline void __deactivate_traps_common(struct kvm_vcpu *vcpu)
> >  {
> > + write_sysreg(vcpu->arch.mdcr_el2_host, mdcr_el2);
> > +
> >   write_sysreg(0, hstr_el2);
> >   if (kvm_arm_support_pmu_v3())
> >   write_sysreg(0, pmuserenr_el0);
> > diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c 
> > b/arch/arm64/kvm/hyp/nvhe/switch.c
> > index f7af9688c1f7..1778593a08a9 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/switch.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/switch.c
> > @@ -69,12 +69,10 @@ static void __activate_traps(struct kvm_vcpu *vcpu)
> >  static void __deactivate_traps(struct kvm_vcpu *vcpu)
> >  {
> >   extern char __kvm_hyp_host_vector[];
> > - u64 mdcr_el2, cptr;
> > + u64 cptr;
> >
> >   ___deactivate_traps(vcpu);
> >
> > - mdcr_el2 = read_sysreg(mdcr_el2);
> > -
> >   if (cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) {
> >   u64 val;
> >
> > @@ -92,13 +90,8 @@ static void __deactivate_traps(struct kvm_vcpu *vcpu)
> >   isb();
> >   }
> >
> > - __deactivate_traps_common();
> > -
> > - mdcr_el2 &= MDCR_EL2_HPMN_MASK;
> > - mdcr_el2 |= MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT;
> > - mdcr_el2 |= MDCR_EL2_E2TB_MASK << MDCR_EL2_E2TB_SHIFT;
> > + __deactivate_traps_common(vcpu);
> >
> > - write_sysreg(mdcr_el2, mdcr_el2);
> >   

Re: [PATCH 08/14] KVM: arm64: Add support for tagging shared pages in page-table

2021-07-21 Thread Fuad Tabba
Hi Quentin,


On Tue, Jul 20, 2021 at 3:06 PM Quentin Perret  wrote:
>
> On Tuesday 20 Jul 2021 at 14:48:09 (+0100), Fuad Tabba wrote:
> > This might tie in to Marc's comments for using enums, but
> > consolidating the translation between prot and ignored/software bits
> > in one place would be good: thinking about patch 10 as well, where you
> > get the prot from the ignored bits. Even though you have documented
> > it, I'm finding the part where a field can be borrowed and shared as
> > opposed to being only shared not very intuitive, and I need to reread
> > the comment here to remember the difference while going through the
> > code.
> >
> > You also mention lending as potentially reserved for the future, but I
> > think that lending is the other side of borrowing (depends on who's
> > doing the giving/taking). I wonder if in this case it would be clearer
> > to describe it in terms of whether it's exclusively owned vs owned but
> > shared (for the owner), and just shared for the sharer...
>
> Argh so I actually found the encoding pretty neat :/
> The idea is the following:
>
>   - an entity that has a page mapped as SHARED in its PT means it
> doesn't have exclusive access to the page;
>
>   - an entity that has a page mapped as BORROWED in its PT means it has
> access to a page it doesn't own;
>
> From that we can build the states we need:
>
>   - when an entity shares a page with another, the original owner gets a
> SHARED mapping, and the recipient a SHARED+BORROWED mapping.
>
>   - and in the future when/if we implement lending (which means an
> entity gives exclusive access to a page to another entity, but
> retains ownership) we can map the page in the recipient as
> 'BORROWED' only, but not 'SHARED'. And the original owner will have
> an invalid mapping with a new state 'LENT', which is encoded with
> both SW bits set.
>
> How does that sound? Did you have something else in mind?

The encoding is very neat by the way :D

I see where you're going with the lent state now, and I understand the
states as well as the possible transitions now that you've explained
it.

It's the terminology that confused me a bit (especially when you
mention lending, which seemed to imply is something distinct from
borrowing as opposed to just the other side of it). What for me would
help is to document this, and the possible combinations/legal states.
kvm_pgtable.h describes the prots a bit, but maybe you could expand
that similar to what you've done in this email:

@KVM_PGTABLE_STATE_BORROWED: Page borrowed from another entity: has
access to the page but no ownership

Not sure if defining aliases for all legal combinations would also
help or add to the confusion, thinking out loud, something along the
lines of cache state taxonomy (e.g., Sweazy and Smith fig 3 [1]). You
have that in the borrowed (as opposed to owned), and shared (as
opposed to exclusive). So aliases to build on these:

#define KVM_PGTABLE_STATE_BORROWED_SHARED (KVM_PGTABLE_STATE_SHARED |
KVM_PGTABLE_STATE_BORROWED)
#define KVM_PGTABLE_STATE_BORROWED_EXCLUSIVE (KVM_PGTABLE_STATE_BORROWED)
#define KVM_PGTABLE_STATE_OWNED_SHARED (KVM_PGTABLE_STATE_SHARED)
#define KVM_PGTABLE_STATE_OWNED_EXCLUSIVE (0ULL)

You have thought about this way more than I have. But I think that
having clear documentation, ideally in the code itself via
helpers/enums/aliases could help people like me who come to the code
later not shoot themselves in the foot.

Thanks!
/fuad

[1] 
https://www.cs.auckland.ac.nz/compsci703s1c/archive/2008/resources/Sweazey.pdf

> Thanks,
> Quentin
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCHv2 4/4] arm64: add host pv-vcpu-state support

2021-07-21 Thread Marc Zyngier
On Wed, 21 Jul 2021 11:38:40 +0100,
Sergey Senozhatsky  wrote:
> 
> On (21/07/21 09:40), Marc Zyngier wrote:
> > > 
> > > Can that be cured by just checking vcpu->preempted before calling
> > > kvm_update_vcpu_preempted() ?
> > 
> > It isn't obvious to me that this is the right thing to do.
> > vcpu->preempted is always updated on sched-out from the preempt
> > notifier if the vcpu was on the run-queue, so my guess is that it will
> > always be set when switching to another task.
> > 
> > What you probably want is to check whether the vcpu is blocked by
> > introspecting the wait-queue with:
> > 
> > scuwait_active(kvm_arch_vcpu_get_wait(vcpu)
> > 
> > which will tell you whether you are blocking or not. We are already
> > using a similar construct for arming a background timer in this case.
> 
> Can we examine if vcpu->run->exit_reason == WFE/WFI and avoid setting
> preempted state if so?

We never go back to userspace for WFI/WFE, so no reason to populate
the run structure.

Checking for the blocked state is the right thing to do, and we
already have the primitive for this. Just use it.

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCHv2 2/4] arm64: add guest pvstate support

2021-07-21 Thread Marc Zyngier
On Wed, 21 Jul 2021 09:47:52 +0100,
Sergey Senozhatsky  wrote:
> 
> On (21/07/21 09:22), Marc Zyngier wrote:
> > On Wed, 21 Jul 2021 03:05:25 +0100,
> > Sergey Senozhatsky  wrote:
> > > 
> > > On (21/07/12 16:42), Marc Zyngier wrote:
> > > > > 
> > > > > PV-vcpu-state is a per-CPU struct, which, for the time being,
> > > > > holds boolean `preempted' vCPU state. During the startup,
> > > > > given that host supports PV-state, each guest vCPU sends
> > > > > a pointer to its per-CPU variable to the host as a payload
> > > > 
> > > > What is the expected memory type for this memory region? What is its
> > > > life cycle? Where is it allocated from?
> > > 
> > > Guest per-CPU area, which physical addresses is shared with the
> > > host.
> > 
> > Again: what are the memory types you expect this to be used with?
> 
> I heard your questions, I'm trying to figure out the answers now.
> 
> As of memory type - I presume you are talking about coherent vs
> non-coherent memory.

No. I'm talking about cacheable vs non-cacheable. The ARM architecture
is always coherent for memory that is inner-shareable, which applies
to any system running Linux. On the other hand, there is no
architected cache snooping when using non-cacheable accesses.

> Can guest per-CPU memory be non-coherent? Guest never writes
> anything to the region of memory it shares with the host, it only
> reads what the host writes to it. All reads and writes are done from
> CPU (no devices DMA access, etc).
> 
> Do we need any cache flushes/syncs in this case?

If you expect the guest to have non-cacheable mappings (or to run with
its MMU off at any point, which amounts to the same thing) *and* still
be able to access the shared page, then *someone* will have to perform
CMOs to make these writes visible to the PoC (unless you have FWB).

Needless to say, this would kill any sort of performance gain this
feature could hypothetically bring. Defining the scope for the access
would help mitigating this, even if that's just a sentence saying "the
shared page *must* be accessed from a cacheable mapping".

> 
> > When will the hypervisor ever stop accessing this?
> 
> KVM always access it for the vcpus that are getting scheduled out or
> scheduled in on the host side.

I was more hinting at whether there was a way to disable this at
runtime. Think of a guest using kexec, for example, where you really
don't want the hypervisor to start messing with memory that has been
since reallocated by the guest.

> > How does it work across reset?
> 
> I need to figure out what happens during reset/migration in the first
> place.

Yup.

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] doc/arm: take care restore order of GICR_* in ITS restore

2021-07-21 Thread Marc Zyngier
On Wed, 21 Jul 2021 10:20:19 +0100,
Jianyong Wu  wrote:
> 
> When restore GIC/ITS, GICR_CTLR must be restored after GICR_PROPBASER
> and GICR_PENDBASER. That is important, as both of GICR_PROPBASER and
> GICR_PENDBASER will fail to be loaded when lpi has enabled yet in
> GICR_CTLR. Keep the restore order above will avoid that issue.
> Shout it out at the doc is very helpful that may avoid lots of debug work.

But that's something that is already mandated by the architecture,
isn't it? See "5.1 LPIs" in the architecture spec:



If GICR_PROPBASER is updated when GICR_CTLR.EnableLPIs == 1, the
effects are UNPREDICTABLE.

[...]

If GICR_PENDBASER is updated when GICR_CTLR.EnableLPIs == 1, the
effects are UNPREDICTABLE.



The point of this documentation is to make it explicit what is *not*
covered by the architecture. Anything that is in the architecture
still applies, and shouldn't be overlooked.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 4/4] KVM: arm64: Remove PMSWINC_EL0 shadow register

2021-07-21 Thread Marc Zyngier
On Tue, 20 Jul 2021 17:44:32 +0100,
Alexandru Elisei  wrote:
> 
> Hi Marc,
> 
> On 7/19/21 5:56 PM, Marc Zyngier wrote:
> > Hi Alex,
> >
> > On 2021-07-19 17:35, Alexandru Elisei wrote:
> >> Hi Marc,
> >>
> >> On 7/19/21 1:39 PM, Marc Zyngier wrote:
> >>> We keep an entry for the PMSWINC_EL0 register in the vcpu structure,
> >>> while *never* writing anything there outside of reset.
> >>>
> >>> Given that the register is defined as write-only, that we always
> >>> trap when this register is accessed, there is little point in saving
> >>> anything anyway.
> >>>
> >>> Get rid of the entry, and save a mighty 8 bytes per vcpu structure.
> >>>
> >>> We still need to keep it exposed to userspace in order to preserve
> >>> backward compatibility with previously saved VMs. Since userspace
> >>> cannot expect any effect of writing to PMSWINC_EL0, treat the
> >>> register as RAZ/WI for the purpose of userspace access.
> >>>
> >>> Signed-off-by: Marc Zyngier 
> >>> ---
> >>>  arch/arm64/include/asm/kvm_host.h |  1 -
> >>>  arch/arm64/kvm/sys_regs.c | 21 -
> >>>  2 files changed, 20 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/arch/arm64/include/asm/kvm_host.h
> >>> b/arch/arm64/include/asm/kvm_host.h
> >>> index 41911585ae0c..afc169630884 100644
> >>> --- a/arch/arm64/include/asm/kvm_host.h
> >>> +++ b/arch/arm64/include/asm/kvm_host.h
> >>> @@ -185,7 +185,6 @@ enum vcpu_sysreg {
> >>>  PMCNTENSET_EL0,    /* Count Enable Set Register */
> >>>  PMINTENSET_EL1,    /* Interrupt Enable Set Register */
> >>>  PMOVSSET_EL0,    /* Overflow Flag Status Set Register */
> >>> -    PMSWINC_EL0,    /* Software Increment Register */
> >>>  PMUSERENR_EL0,    /* User Enable Register */
> >>>
> >>>  /* Pointer Authentication Registers in a strict increasing order. */
> >>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> >>> index f22139658e48..a1f5101f49a3 100644
> >>> --- a/arch/arm64/kvm/sys_regs.c
> >>> +++ b/arch/arm64/kvm/sys_regs.c
> >>> @@ -1286,6 +1286,20 @@ static int set_raz_id_reg(struct kvm_vcpu *vcpu, 
> >>> const
> >>> struct sys_reg_desc *rd,
> >>>  return __set_id_reg(vcpu, rd, uaddr, true);
> >>>  }
> >>>
> >>> +static int set_wi_reg(struct kvm_vcpu *vcpu, const struct sys_reg_desc 
> >>> *rd,
> >>> +  const struct kvm_one_reg *reg, void __user *uaddr)
> >>> +{
> >>> +    int err;
> >>> +    u64 val;
> >>> +
> >>> +    /* Perform the access even if we are going to ignore the value */
> >>> +    err = reg_from_user(, uaddr, sys_reg_to_index(rd));
> >>
> >> I don't understand why the read still happens if the value is ignored.
> >> Just so KVM
> >> preserves the previous behaviour and tells userspace there was an error?
> >
> > If userspace has given us a duff pointer, it needs to know about it.
> 
> Makes sense, thanks.
> 
> >
> >>> +    if (err)
> >>> +    return err;
> >>> +
> >>> +    return 0;
> >>> +}
> >>> +
> >>>  static bool access_ctr(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
> >>>     const struct sys_reg_desc *r)
> >>>  {
> >>> @@ -1629,8 +1643,13 @@ static const struct sys_reg_desc sys_reg_descs[] = 
> >>> {
> >>>    .access = access_pmcnten, .reg = PMCNTENSET_EL0 },
> >>>  { PMU_SYS_REG(SYS_PMOVSCLR_EL0),
> >>>    .access = access_pmovs, .reg = PMOVSSET_EL0 },
> >>> +    /*
> >>> + * PM_SWINC_EL0 is exposed to userspace as RAZ/WI, as it was
> >>> + * previously (and pointlessly) advertised in the past...
> >>> + */
> >>>  { PMU_SYS_REG(SYS_PMSWINC_EL0),
> >>> -  .access = access_pmswinc, .reg = PMSWINC_EL0 },
> >>> +  .get_user = get_raz_id_reg, .set_user = set_wi_reg,
> >>
> >> In my opinion, the call chain to return 0 looks pretty confusing to me, as 
> >> the
> >> functions seemed made for ID register accesses, and the leaf function,
> >> read_id_reg(), tries to match this register with a list of ID
> >> registers. Since we
> >> have already added a new function just for PMSWINC_EL0, I was
> >> wondering if adding
> >> another function, something like get_raz_reg(), would make more sense.
> >
> > In that case, I'd rather just kill get_raz_id_reg() and replace it with
> > this get_raz_reg(). If we trat something as RAZ, who cares whether it is
> > an idreg or not?
> 
> I agree, the Arm ARM doesn't make the distinction between ID
> registers and other system registers in the definition of RAZ, I
> don't think KVM should either. And the way read_id_reg() is written
> allows returning a value different than 0 even if raz is true, which
> in my opinion could only happen because of a bug in KVM.
> 
> I can have a go at writing the patch(es) on top of this series, if
> you want. At the moment I'm rewriting the KVM SPE series, so it will
> be a few weeks until I get around to doing it though.

We can do that at any time, it is just a cleanup without any guest or
userspace visible effect. If a get a spare hour, I'll have a
look. Otherwise, this can wait a 

[PATCH] doc/arm: take care restore order of GICR_* in ITS restore

2021-07-21 Thread Jianyong Wu
When restore GIC/ITS, GICR_CTLR must be restored after GICR_PROPBASER
and GICR_PENDBASER. That is important, as both of GICR_PROPBASER and
GICR_PENDBASER will fail to be loaded when lpi has enabled yet in
GICR_CTLR. Keep the restore order above will avoid that issue.
Shout it out at the doc is very helpful that may avoid lots of debug work.

Signed-off-by: Jianyong Wu 
---
 Documentation/virt/kvm/devices/arm-vgic-its.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/devices/arm-vgic-its.rst 
b/Documentation/virt/kvm/devices/arm-vgic-its.rst
index d257eddbae29..6b36de6937f8 100644
--- a/Documentation/virt/kvm/devices/arm-vgic-its.rst
+++ b/Documentation/virt/kvm/devices/arm-vgic-its.rst
@@ -126,7 +126,8 @@ ITS Restore Sequence:
 The following ordering must be followed when restoring the GIC and the ITS:
 
 a) restore all guest memory and create vcpus
-b) restore all redistributors
+b) restore all redistributors:
+   make sure restore GICR_CTLR after GICR_PROPBASER and GICR_PENDBASER
 c) provide the ITS base address
(KVM_DEV_ARM_VGIC_GRP_ADDR)
 d) restore the ITS in the following order:
-- 
2.17.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCHv2 4/4] arm64: add host pv-vcpu-state support

2021-07-21 Thread Marc Zyngier
On Wed, 21 Jul 2021 02:15:47 +0100,
Sergey Senozhatsky  wrote:
> 
> On (21/07/12 17:24), Marc Zyngier wrote:
> > >  
> > >  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> > >  {
> > > + kvm_update_vcpu_preempted(vcpu, true);
> > 
> > This doesn't look right. With this, you are now telling the guest that
> > a vcpu that is blocked on WFI is preempted. This really isn't the
> > case, as it has voluntarily entered a low-power mode while waiting for
> > an interrupt. Indeed, the vcpu isn't running. A physical CPU wouldn't
> > be running either.
> 
> I suppose you are talking about kvm_vcpu_block().

kvm_vcpu_block() is how things are implemented. WFI is the instruction
I'm concerned about.

> Well, it checks kvm_vcpu_check_block() but then it simply schedule()
> out the vcpu process, which does look like "the vcpu is
> preempted". Once we sched_in() that vcpu process again we mark it as
> non-preempted, even though it remains in kvm wfx handler. Why isn't
> it right?

Because the vcpu hasn't been "preempted". It has *voluntarily* gone
into a low-power mode, and how KVM implements this "low-power mode" is
none of the guest's business. This is exactly the same behaviour that
you will have on bare metal. From a Linux guest perspective, the vcpu
is *idle*, not doing anything, and only waiting for an interrupt to
start executing again.

This is a fundamentally different concept from preempting a vcpu
because its time-slice is up. In this second case, you can indeed
mitigate things by exposing steal time and preemption status as you
break the illusion of a machine that is completely controlled by the
guest.

If the "reched on interrupt delivery while blocked on WFI" is too slow
for you, then *that* is the thing that needs addressing. Feeding extra
state to the guest doesn't help.

> Another call path is iret:
> 
> 
> __schedule()
>  context_switch()
>   prepare_task_switch()
>fire_sched_in_preempt_notifiers()
> kvm_sched_out()
>  kvm_arch_vcpu_put()

I'm not sure how a x86 concept is relevant here.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


RE: [RFC PATCH 4/5] iommu/arm-smmu-v3: Use pinned VMID for NESTED stage with BTM

2021-07-21 Thread Shameerali Kolothum Thodi
Hi Jean,

> -Original Message-
> From: Jean-Philippe Brucker [mailto:jean-phili...@linaro.org]
> Sent: 04 March 2021 17:11
> To: Shameerali Kolothum Thodi 
> Cc: linux-arm-ker...@lists.infradead.org; io...@lists.linux-foundation.org;
> kvmarm@lists.cs.columbia.edu; m...@kernel.org;
> alex.william...@redhat.com; eric.au...@redhat.com;
> zhangfei@linaro.org; Jonathan Cameron
> ; Zengtao (B) ;
> linux...@openeuler.org
> Subject: Re: [RFC PATCH 4/5] iommu/arm-smmu-v3: Use pinned VMID for
> NESTED stage with BTM

[...]

> >
> > kfree(smmu_domain);
> > @@ -3199,6 +3230,17 @@ static int arm_smmu_attach_pasid_table(struct
> iommu_domain *domain,
> > !(smmu->features & ARM_SMMU_FEAT_2_LVL_CDTAB))
> > goto out;
> >
> > +   if (smmu->features & ARM_SMMU_FEAT_BTM) {
> > +   ret = arm_smmu_pinned_vmid_get(smmu_domain);
> > +   if (ret < 0)
> > +   goto out;
> > +
> > +   if (smmu_domain->s2_cfg.vmid)
> > +   arm_smmu_bitmap_free(smmu->vmid_map,
> smmu_domain->s2_cfg.vmid);
> > +
> > +   smmu_domain->s2_cfg.vmid = (u16)ret;
> 
> That will require a TLB invalidation on the old VMID, once the STE is
> rewritten.
> 
> More generally I think this pinned VMID set conflicts with that of
> stage-2-only domains (which is the default state until a guest attaches a
> PASID table). Say you have one guest using DOMAIN_NESTED without PASID
> table, just DMA to IPA using VMID 0x8000. Now another guest attaches a
> PASID table and obtains the same VMID from KVM. The stage-2 translation
> might use TLB entries from the other guest, no?  They'll both create
> stage-2 TLB entries with {StreamWorld=NS-EL1, VMID=0x8000}

Now that we are trying to align the KVM VMID allocation algorithm similar to
that of the ASID allocator [1], I attempted to use that for the SMMU pinned 
VMID allocation. But the issue you have mentioned above is still valid. 

And as a solution what I have tried now is follow what pinned ASID is doing 
in SVA,
 -Use xarray for private VMIDs
 -Get pinned VMID from KVM for DOMAIN_NESTED with PASID table
 -If the new pinned VMID is in use by private, then update the private
  VMID(VMID update to a live STE).

This seems to work, but still need to run more tests with this though.  

> It's tempting to allocate all VMIDs through KVM instead, but that will
> force a dependency on KVM to use VFIO_TYPE1_NESTING_IOMMU and might
> break
> existing users of that extension (though I'm not sure there are any).
> Instead we might need to restrict the SMMU VMID bitmap to match the
> private VMID set in KVM.

Another solution I have in mind is, make the new KVM VMID allocator common
between SMMUv3 and KVM. This will help to avoid all the private and shared
VMID splitting, also no need for live updates to STE VMID. One possible drawback
is less number of available KVM VMIDs but with 16 bit VMID space I am not sure
how much that is a concern.

Please let me know your thoughts.

Thanks,
Shameer

[1]. 
https://lore.kernel.org/kvmarm/20210616155606.2806-1-shameerali.kolothum.th...@huawei.com/

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCHv2 4/4] arm64: add host pv-vcpu-state support

2021-07-21 Thread Marc Zyngier
On Tue, 20 Jul 2021 19:44:53 +0100,
Joel Fernandes  wrote:
> 
> On Mon, Jul 12, 2021 at 12:24 PM Marc Zyngier  wrote:
> >
> [...]
> > >  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> > >  {
> > > + kvm_update_vcpu_preempted(vcpu, true);
> >
> > This doesn't look right. With this, you are now telling the guest that
> > a vcpu that is blocked on WFI is preempted. This really isn't the
> > case, as it has voluntarily entered a low-power mode while waiting for
> > an interrupt. Indeed, the vcpu isn't running. A physical CPU wouldn't
> > be running either.
> 
> Can that be cured by just checking vcpu->preempted before calling
> kvm_update_vcpu_preempted() ?

It isn't obvious to me that this is the right thing to do.
vcpu->preempted is always updated on sched-out from the preempt
notifier if the vcpu was on the run-queue, so my guess is that it will
always be set when switching to another task.

What you probably want is to check whether the vcpu is blocked by
introspecting the wait-queue with:

scuwait_active(kvm_arch_vcpu_get_wait(vcpu)

which will tell you whether you are blocking or not. We are already
using a similar construct for arming a background timer in this case.

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCHv2 2/4] arm64: add guest pvstate support

2021-07-21 Thread Marc Zyngier
On Wed, 21 Jul 2021 03:05:25 +0100,
Sergey Senozhatsky  wrote:
> 
> On (21/07/12 16:42), Marc Zyngier wrote:
> > > 
> > > PV-vcpu-state is a per-CPU struct, which, for the time being,
> > > holds boolean `preempted' vCPU state. During the startup,
> > > given that host supports PV-state, each guest vCPU sends
> > > a pointer to its per-CPU variable to the host as a payload
> > 
> > What is the expected memory type for this memory region? What is its
> > life cycle? Where is it allocated from?
> 
> Guest per-CPU area, which physical addresses is shared with the
> host.

Again: what are the memory types you expect this to be used with? When
will the hypervisor ever stop accessing this? How does it work across
reset?

I'm sorry to be that pressing, but these are the gory details that
actually matter if you want this thing to be maintainable.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm