[PATCH kernel v3] KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE support platform dependent

2022-09-20 Thread Alexey Kardashevskiy
When introduced, IRQFD resampling worked on POWER8 with XICS. However
KVM on POWER9 has never implemented it - the compatibility mode code
("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native
XIVE mode does not handle INTx in KVM at all.

This moved the capability support advertising to platforms and stops
advertising it on XIVE, i.e. POWER9 and later.

This stops advertising the capability on MIPS and RISC-V as these
do not select HAVE_KVM_IRQFD and do not implement IRQFD resampling
anyway.

Signed-off-by: Alexey Kardashevskiy 
Acked-by: Nicholas Piggin 
---
Changes:
v3:
* removed all ifdeferry
* removed the capability for MIPS and RISCV
* adjusted the commit log about MIPS and RISCV

v2:
* removed ifdef for ARM64.
---
 arch/arm64/kvm/arm.c   | 1 +
 arch/powerpc/kvm/powerpc.c | 6 ++
 arch/s390/kvm/kvm-s390.c   | 1 +
 arch/x86/kvm/x86.c | 1 +
 virt/kvm/kvm_main.c| 1 -
 5 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 2ff0ef62abad..d2daa4d375b5 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -218,6 +218,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_VCPU_ATTRIBUTES:
case KVM_CAP_PTP_KVM:
case KVM_CAP_ARM_SYSTEM_SUSPEND:
+   case KVM_CAP_IRQFD_RESAMPLE:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index fb1490761c87..908ce8bd91c9 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -593,6 +593,12 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
break;
 #endif
 
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+   r = !xive_enabled();
+   break;
+#endif
+
case KVM_CAP_PPC_ALLOC_HTAB:
r = hv_enabled;
break;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index edfd4bbd0cba..7521adadb81b 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -577,6 +577,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_SET_GUEST_DEBUG:
case KVM_CAP_S390_DIAG318:
case KVM_CAP_S390_MEM_OP_EXTENSION:
+   case KVM_CAP_IRQFD_RESAMPLE:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 43a6a7efc6ec..2d6c5a8fdf14 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4395,6 +4395,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_VAPIC:
case KVM_CAP_ENABLE_CAP:
case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
+   case KVM_CAP_IRQFD_RESAMPLE:
r = 1;
break;
case KVM_CAP_EXIT_HYPERCALL:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 584a5bab3af3..05cf94013f02 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4447,7 +4447,6 @@ static long kvm_vm_ioctl_check_extension_generic(struct 
kvm *kvm, long arg)
 #endif
 #ifdef CONFIG_HAVE_KVM_IRQFD
case KVM_CAP_IRQFD:
-   case KVM_CAP_IRQFD_RESAMPLE:
 #endif
case KVM_CAP_IOEVENTFD_ANY_LENGTH:
case KVM_CAP_CHECK_EXTENSION_VM:
-- 
2.37.3

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 3/7] mm: Add PG_arch_3 page flag

2022-09-20 Thread Peter Collingbourne
On Tue, Sep 20, 2022 at 9:58 AM Catalin Marinas  wrote:
>
> On Tue, Sep 20, 2022 at 05:33:42PM +0100, Marc Zyngier wrote:
> > On Tue, 20 Sep 2022 16:39:47 +0100,
> > Catalin Marinas  wrote:
> > > On Mon, Sep 19, 2022 at 07:12:53PM +0100, Marc Zyngier wrote:
> > > > On Mon, 05 Sep 2022 18:01:55 +0100,
> > > > Catalin Marinas  wrote:
> > > > > Peter, please let me know if you want to pick this series up together
> > > > > with your other KVM patches. Otherwise I can post it separately, it's
> > > > > worth merging it on its own as it clarifies the page flag vs tag 
> > > > > setting
> > > > > ordering.
> > > >
> > > > I'm looking at queuing this, but I'm confused by this comment. Do I
> > > > need to pick this as part of the series? Or is this an independent
> > > > thing (my hunch is that it is actually required not to break other
> > > > architectures...).
> > >
> > > This series series (at least the first patches) won't apply cleanly on
> > > top of 6.0-rc1 and, of course, we shouldn't break other architectures. I
> > > can repost the whole series but I don't have the setup to test the
> > > MAP_SHARED KVM option (unless Peter plans to post it soon).
> >
> > I don't feel brave enough to take a series affecting all architectures
>
> It shouldn't affect the others, the only change is that PG_arch_2 is now
> only defined for arm64 but no other architecture is using it. The
> problem with loongarch is that it doesn't have enough spare bits in
> page->flags and even without any patches I think it's broken with the
> right value for NR_CPUS.
>
> > so late in the game, and the whole thing had very little arm64
> > exposure. The latest QEMU doesn't seem to work anymore, so I don't
> > have any MTE-capable emulation (and using the FVP remotely is a pain
> > in the proverbial neck).
> >
> > I'll come back to this after the merge window, should Peter decide to
> > respin the series.
>
> It makes sense.

Apologies for the delay, I've now sent out v4 of this series which
includes the patches on your branch.

Peter
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 7/8] KVM: arm64: permit all VM_MTE_ALLOWED mappings with MTE enabled

2022-09-20 Thread Peter Collingbourne
Certain VMMs such as crosvm have features (e.g. sandboxing) that depend
on being able to map guest memory as MAP_SHARED. The current restriction
on sharing MAP_SHARED pages with the guest is preventing the use of
those features with MTE. Now that the races between tasks concurrently
clearing tags on the same page have been fixed, remove this restriction.

Note that this is a relaxation of the ABI.

Signed-off-by: Peter Collingbourne 
Reviewed-by: Catalin Marinas 
Reviewed-by: Steven Price 
---
 arch/arm64/kvm/mmu.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index e34fbabd8b93..996ea11fb0e5 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1075,14 +1075,6 @@ static void sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t 
pfn,
 
 static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
 {
-   /*
-* VM_SHARED mappings are not allowed with MTE to avoid races
-* when updating the PG_mte_tagged page flag, see
-* sanitise_mte_tags for more details.
-*/
-   if (vma->vm_flags & VM_SHARED)
-   return false;
-
return vma->vm_flags & VM_MTE_ALLOWED;
 }
 
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 5/8] arm64: mte: Lock a page for MTE tag initialisation

2022-09-20 Thread Peter Collingbourne
From: Catalin Marinas 

Initialising the tags and setting PG_mte_tagged flag for a page can race
between multiple set_pte_at() on shared pages or setting the stage 2 pte
via user_mem_abort(). Introduce a new PG_mte_lock flag as PG_arch_3 and
set it before attempting page initialisation. Given that PG_mte_tagged
is never cleared for a page, consider setting this flag to mean page
unlocked and wait on this bit with acquire semantics if the page is
locked:

- try_page_mte_tagging() - lock the page for tagging, return true if it
  can be tagged, false if already tagged. No acquire semantics if it
  returns true (PG_mte_tagged not set) as there is no serialisation with
  a previous set_page_mte_tagged().

- set_page_mte_tagged() - set PG_mte_tagged with release semantics.

The two-bit locking is based on Peter Collingbourne's idea.

Signed-off-by: Catalin Marinas 
Signed-off-by: Peter Collingbourne 
Reviewed-by: Steven Price 
Cc: Will Deacon 
Cc: Marc Zyngier 
Cc: Peter Collingbourne 
---
 arch/arm64/include/asm/mte.h | 35 +++-
 arch/arm64/include/asm/pgtable.h |  4 ++--
 arch/arm64/kernel/cpufeature.c   |  2 +-
 arch/arm64/kernel/mte.c  | 12 +--
 arch/arm64/kvm/guest.c   | 16 +--
 arch/arm64/kvm/mmu.c |  2 +-
 arch/arm64/mm/copypage.c |  2 ++
 arch/arm64/mm/fault.c|  2 ++
 arch/arm64/mm/mteswap.c  | 11 +-
 9 files changed, 64 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 46618c575eac..be6560e1ff2b 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -25,7 +25,7 @@ unsigned long mte_copy_tags_to_user(void __user *to, void 
*from,
unsigned long n);
 int mte_save_tags(struct page *page);
 void mte_save_page_tags(const void *page_addr, void *tag_storage);
-bool mte_restore_tags(swp_entry_t entry, struct page *page);
+void mte_restore_tags(swp_entry_t entry, struct page *page);
 void mte_restore_page_tags(void *page_addr, const void *tag_storage);
 void mte_invalidate_tags(int type, pgoff_t offset);
 void mte_invalidate_tags_area(int type);
@@ -36,6 +36,8 @@ void mte_free_tag_storage(char *storage);
 
 /* track which pages have valid allocation tags */
 #define PG_mte_tagged  PG_arch_2
+/* simple lock to avoid multiple threads tagging the same page */
+#define PG_mte_lockPG_arch_3
 
 static inline void set_page_mte_tagged(struct page *page)
 {
@@ -60,6 +62,33 @@ static inline bool page_mte_tagged(struct page *page)
return ret;
 }
 
+/*
+ * Lock the page for tagging and return 'true' if the page can be tagged,
+ * 'false' if already tagged. PG_mte_tagged is never cleared and therefore the
+ * locking only happens once for page initialisation.
+ *
+ * The page MTE lock state:
+ *
+ *   Locked:   PG_mte_lock && !PG_mte_tagged
+ *   Unlocked: !PG_mte_lock || PG_mte_tagged
+ *
+ * Acquire semantics only if the page is tagged (returning 'false').
+ */
+static inline bool try_page_mte_tagging(struct page *page)
+{
+   if (!test_and_set_bit(PG_mte_lock, >flags))
+   return true;
+
+   /*
+* The tags are either being initialised or may have been initialised
+* already. Check if the PG_mte_tagged flag has been set or wait
+* otherwise.
+*/
+   smp_cond_load_acquire(>flags, VAL & (1UL << PG_mte_tagged));
+
+   return false;
+}
+
 void mte_zero_clear_page_tags(void *addr);
 void mte_sync_tags(pte_t old_pte, pte_t pte);
 void mte_copy_page_tags(void *kto, const void *kfrom);
@@ -84,6 +113,10 @@ static inline bool page_mte_tagged(struct page *page)
 {
return false;
 }
+static inline bool try_page_mte_tagging(struct page *page)
+{
+   return false;
+}
 static inline void mte_zero_clear_page_tags(void *addr)
 {
 }
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 98b638441521..8735ac1a1e32 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1049,8 +1049,8 @@ static inline void arch_swap_invalidate_area(int type)
 #define __HAVE_ARCH_SWAP_RESTORE
 static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 {
-   if (system_supports_mte() && mte_restore_tags(entry, >page))
-   set_page_mte_tagged(>page);
+   if (system_supports_mte())
+   mte_restore_tags(entry, >page);
 }
 
 #endif /* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index ab3312788d60..e2c0a707a941 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2049,7 +2049,7 @@ static void cpu_enable_mte(struct arm64_cpu_capabilities 
const *cap)
 * Clear the tags in the zero page. This needs to be done via the
 * linear map which has the Tagged attribute.
 */
-   if (!page_mte_tagged(ZERO_PAGE(0))) {
+   if 

[PATCH v4 4/8] mm: Add PG_arch_3 page flag

2022-09-20 Thread Peter Collingbourne
As with PG_arch_2, this flag is only allowed on 64-bit architectures due
to the shortage of bits available. It will be used by the arm64 MTE code
in subsequent patches.

Signed-off-by: Peter Collingbourne 
Cc: Will Deacon 
Cc: Marc Zyngier 
Cc: Steven Price 
[catalin.mari...@arm.com: added flag preserving in __split_huge_page_tail()]
Signed-off-by: Catalin Marinas 
---
 fs/proc/page.c| 1 +
 include/linux/kernel-page-flags.h | 1 +
 include/linux/page-flags.h| 1 +
 include/trace/events/mmflags.h| 1 +
 mm/huge_memory.c  | 1 +
 5 files changed, 5 insertions(+)

diff --git a/fs/proc/page.c b/fs/proc/page.c
index 6f4b4bcb9b0d..43d371e6b366 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -220,6 +220,7 @@ u64 stable_page_flags(struct page *page)
u |= kpf_copy_bit(k, KPF_ARCH,  PG_arch_1);
 #ifdef CONFIG_ARCH_USES_PG_ARCH_X
u |= kpf_copy_bit(k, KPF_ARCH_2,PG_arch_2);
+   u |= kpf_copy_bit(k, KPF_ARCH_3,PG_arch_3);
 #endif
 
return u;
diff --git a/include/linux/kernel-page-flags.h 
b/include/linux/kernel-page-flags.h
index eee1877a354e..859f4b0c1b2b 100644
--- a/include/linux/kernel-page-flags.h
+++ b/include/linux/kernel-page-flags.h
@@ -18,5 +18,6 @@
 #define KPF_UNCACHED   39
 #define KPF_SOFTDIRTY  40
 #define KPF_ARCH_2 41
+#define KPF_ARCH_3 42
 
 #endif /* LINUX_KERNEL_PAGE_FLAGS_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5dc7977edf9d..c50ce2812f17 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -134,6 +134,7 @@ enum pageflags {
 #endif
 #ifdef CONFIG_ARCH_USES_PG_ARCH_X
PG_arch_2,
+   PG_arch_3,
 #endif
 #ifdef CONFIG_KASAN_HW_TAGS
PG_skip_kasan_poison,
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 4673e58a7626..9db52bc4ce19 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -130,6 +130,7 @@ IF_HAVE_PG_HWPOISON(PG_hwpoison,"hwpoison"  )   
\
 IF_HAVE_PG_IDLE(PG_young,  "young" )   \
 IF_HAVE_PG_IDLE(PG_idle,   "idle"  )   \
 IF_HAVE_PG_ARCH_X(PG_arch_2,   "arch_2")   \
+IF_HAVE_PG_ARCH_X(PG_arch_3,   "arch_3")   \
 IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
 
 #define show_page_flags(flags) \
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 24974a4ce28f..c7c5f9fb226d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2446,6 +2446,7 @@ static void __split_huge_page_tail(struct page *head, int 
tail,
 (1L << PG_unevictable) |
 #ifdef CONFIG_ARCH_USES_PG_ARCH_X
 (1L << PG_arch_2) |
+(1L << PG_arch_3) |
 #endif
 (1L << PG_dirty) |
 LRU_GEN_MASK | LRU_REFS_MASK));
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 8/8] Documentation: document the ABI changes for KVM_CAP_ARM_MTE

2022-09-20 Thread Peter Collingbourne
Document both the restriction on VM_MTE_ALLOWED mappings and
the relaxation for shared mappings.

Signed-off-by: Peter Collingbourne 
Acked-by: Catalin Marinas 
---
 Documentation/virt/kvm/api.rst | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index abd7c32126ce..7afe603567fd 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7486,8 +7486,9 @@ hibernation of the host; however the VMM needs to 
manually save/restore the
 tags as appropriate if the VM is migrated.
 
 When this capability is enabled all memory in memslots must be mapped as
-not-shareable (no MAP_SHARED), attempts to create a memslot with a
-MAP_SHARED mmap will result in an -EINVAL return.
+``MAP_ANONYMOUS`` or with a RAM-based file mapping (``tmpfs``, ``memfd``),
+attempts to create a memslot with an invalid mmap will result in an
+-EINVAL return.
 
 When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
 perform a bulk copy of tags to/from the guest.
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 6/8] KVM: arm64: unify the tests for VMAs in memslots when MTE is enabled

2022-09-20 Thread Peter Collingbourne
Previously we allowed creating a memslot containing a private mapping that
was not VM_MTE_ALLOWED, but would later reject KVM_RUN with -EFAULT. Now
we reject the memory region at memslot creation time.

Since this is a minor tweak to the ABI (a VMM that created one of
these memslots would fail later anyway), no VMM to my knowledge has
MTE support yet, and the hardware with the necessary features is not
generally available, we can probably make this ABI change at this point.

Signed-off-by: Peter Collingbourne 
Reviewed-by: Catalin Marinas 
Reviewed-by: Steven Price 
---
 arch/arm64/kvm/mmu.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index bebfd1e0bbf0..e34fbabd8b93 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1073,6 +1073,19 @@ static void sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t 
pfn,
}
 }
 
+static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
+{
+   /*
+* VM_SHARED mappings are not allowed with MTE to avoid races
+* when updating the PG_mte_tagged page flag, see
+* sanitise_mte_tags for more details.
+*/
+   if (vma->vm_flags & VM_SHARED)
+   return false;
+
+   return vma->vm_flags & VM_MTE_ALLOWED;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  struct kvm_memory_slot *memslot, unsigned long hva,
  unsigned long fault_status)
@@ -1249,9 +1262,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
}
 
if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
-   /* Check the VMM hasn't introduced a new VM_SHARED VMA */
-   if ((vma->vm_flags & VM_MTE_ALLOWED) &&
-   !(vma->vm_flags & VM_SHARED)) {
+   /* Check the VMM hasn't introduced a new disallowed VMA */
+   if (kvm_vma_mte_allowed(vma)) {
sanitise_mte_tags(kvm, pfn, vma_pagesize);
} else {
ret = -EFAULT;
@@ -1695,12 +1707,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
if (!vma)
break;
 
-   /*
-* VM_SHARED mappings are not allowed with MTE to avoid races
-* when updating the PG_mte_tagged page flag, see
-* sanitise_mte_tags for more details.
-*/
-   if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED) {
+   if (kvm_has_mte(kvm) && !kvm_vma_mte_allowed(vma)) {
ret = -EINVAL;
break;
}
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 3/8] KVM: arm64: Simplify the sanitise_mte_tags() logic

2022-09-20 Thread Peter Collingbourne
From: Catalin Marinas 

Currently sanitise_mte_tags() checks if it's an online page before
attempting to sanitise the tags. Such detection should be done in the
caller via the VM_MTE_ALLOWED vma flag. Since kvm_set_spte_gfn() does
not have the vma, leave the page unmapped if not already tagged. Tag
initialisation will be done on a subsequent access fault in
user_mem_abort().

Signed-off-by: Catalin Marinas 
[p...@google.com: fix the page initializer]
Signed-off-by: Peter Collingbourne 
Reviewed-by: Steven Price 
Cc: Will Deacon 
Cc: Marc Zyngier 
Cc: Peter Collingbourne 
---
 arch/arm64/kvm/mmu.c | 40 +++-
 1 file changed, 15 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 012ed1bc0762..5a131f009cf9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1056,23 +1056,14 @@ static int get_vma_page_shift(struct vm_area_struct 
*vma, unsigned long hva)
  * - mmap_lock protects between a VM faulting a page in and the VMM performing
  *   an mprotect() to add VM_MTE
  */
-static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
-unsigned long size)
+static void sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
+ unsigned long size)
 {
unsigned long i, nr_pages = size >> PAGE_SHIFT;
-   struct page *page;
+   struct page *page = pfn_to_page(pfn);
 
if (!kvm_has_mte(kvm))
-   return 0;
-
-   /*
-* pfn_to_online_page() is used to reject ZONE_DEVICE pages
-* that may not support tags.
-*/
-   page = pfn_to_online_page(pfn);
-
-   if (!page)
-   return -EFAULT;
+   return;
 
for (i = 0; i < nr_pages; i++, page++) {
if (!page_mte_tagged(page)) {
@@ -1080,8 +1071,6 @@ static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t 
pfn,
set_page_mte_tagged(page);
}
}
-
-   return 0;
 }
 
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
@@ -1092,7 +1081,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
bool write_fault, writable, force_pte = false;
bool exec_fault;
bool device = false;
-   bool shared;
unsigned long mmu_seq;
struct kvm *kvm = vcpu->kvm;
struct kvm_mmu_memory_cache *memcache = >arch.mmu_page_cache;
@@ -1142,8 +1130,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
vma_shift = get_vma_page_shift(vma, hva);
}
 
-   shared = (vma->vm_flags & VM_SHARED);
-
switch (vma_shift) {
 #ifndef __PAGETABLE_PMD_FOLDED
case PUD_SHIFT:
@@ -1264,12 +1250,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 
if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) {
/* Check the VMM hasn't introduced a new VM_SHARED VMA */
-   if (!shared)
-   ret = sanitise_mte_tags(kvm, pfn, vma_pagesize);
-   else
+   if ((vma->vm_flags & VM_MTE_ALLOWED) &&
+   !(vma->vm_flags & VM_SHARED)) {
+   sanitise_mte_tags(kvm, pfn, vma_pagesize);
+   } else {
ret = -EFAULT;
-   if (ret)
goto out_unlock;
+   }
}
 
if (writable)
@@ -1491,15 +1478,18 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range)
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
kvm_pfn_t pfn = pte_pfn(range->pte);
-   int ret;
 
if (!kvm->arch.mmu.pgt)
return false;
 
WARN_ON(range->end - range->start != 1);
 
-   ret = sanitise_mte_tags(kvm, pfn, PAGE_SIZE);
-   if (ret)
+   /*
+* If the page isn't tagged, defer to user_mem_abort() for sanitising
+* the MTE tags. The S2 pte should have been unmapped by
+* mmu_notifier_invalidate_range_end().
+*/
+   if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
return false;
 
/*
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 2/8] arm64: mte: Fix/clarify the PG_mte_tagged semantics

2022-09-20 Thread Peter Collingbourne
From: Catalin Marinas 

Currently the PG_mte_tagged page flag mostly means the page contains
valid tags and it should be set after the tags have been cleared or
restored. However, in mte_sync_tags() it is set before setting the tags
to avoid, in theory, a race with concurrent mprotect(PROT_MTE) for
shared pages. However, a concurrent mprotect(PROT_MTE) with a copy on
write in another thread can cause the new page to have stale tags.
Similarly, tag reading via ptrace() can read stale tags if the
PG_mte_tagged flag is set before actually clearing/restoring the tags.

Fix the PG_mte_tagged semantics so that it is only set after the tags
have been cleared or restored. This is safe for swap restoring into a
MAP_SHARED or CoW page since the core code takes the page lock. Add two
functions to test and set the PG_mte_tagged flag with acquire and
release semantics. The downside is that concurrent mprotect(PROT_MTE) on
a MAP_SHARED page may cause tag loss. This is already the case for KVM
guests if a VMM changes the page protection while the guest triggers a
user_mem_abort().

Signed-off-by: Catalin Marinas 
[p...@google.com: fix build with CONFIG_ARM64_MTE disabled]
Signed-off-by: Peter Collingbourne 
Reviewed-by: Cornelia Huck 
Reviewed-by: Steven Price 
Cc: Will Deacon 
Cc: Marc Zyngier 
Cc: Peter Collingbourne 
---
 arch/arm64/include/asm/mte.h | 30 ++
 arch/arm64/include/asm/pgtable.h |  2 +-
 arch/arm64/kernel/cpufeature.c   |  4 +++-
 arch/arm64/kernel/elfcore.c  |  2 +-
 arch/arm64/kernel/hibernate.c|  2 +-
 arch/arm64/kernel/mte.c  | 12 +++-
 arch/arm64/kvm/guest.c   |  4 ++--
 arch/arm64/kvm/mmu.c |  4 ++--
 arch/arm64/mm/copypage.c |  5 +++--
 arch/arm64/mm/fault.c|  2 +-
 arch/arm64/mm/mteswap.c  |  2 +-
 11 files changed, 52 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index aa523591a44e..46618c575eac 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -37,6 +37,29 @@ void mte_free_tag_storage(char *storage);
 /* track which pages have valid allocation tags */
 #define PG_mte_tagged  PG_arch_2
 
+static inline void set_page_mte_tagged(struct page *page)
+{
+   /*
+* Ensure that the tags written prior to this function are visible
+* before the page flags update.
+*/
+   smp_wmb();
+   set_bit(PG_mte_tagged, >flags);
+}
+
+static inline bool page_mte_tagged(struct page *page)
+{
+   bool ret = test_bit(PG_mte_tagged, >flags);
+
+   /*
+* If the page is tagged, ensure ordering with a likely subsequent
+* read of the tags.
+*/
+   if (ret)
+   smp_rmb();
+   return ret;
+}
+
 void mte_zero_clear_page_tags(void *addr);
 void mte_sync_tags(pte_t old_pte, pte_t pte);
 void mte_copy_page_tags(void *kto, const void *kfrom);
@@ -54,6 +77,13 @@ size_t mte_probe_user_range(const char __user *uaddr, size_t 
size);
 /* unused if !CONFIG_ARM64_MTE, silence the compiler */
 #define PG_mte_tagged  0
 
+static inline void set_page_mte_tagged(struct page *page)
+{
+}
+static inline bool page_mte_tagged(struct page *page)
+{
+   return false;
+}
 static inline void mte_zero_clear_page_tags(void *addr)
 {
 }
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 71a1af42f0e8..98b638441521 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1050,7 +1050,7 @@ static inline void arch_swap_invalidate_area(int type)
 static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 {
if (system_supports_mte() && mte_restore_tags(entry, >page))
-   set_bit(PG_mte_tagged, >flags);
+   set_page_mte_tagged(>page);
 }
 
 #endif /* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 5d0527ba0804..ab3312788d60 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2049,8 +2049,10 @@ static void cpu_enable_mte(struct arm64_cpu_capabilities 
const *cap)
 * Clear the tags in the zero page. This needs to be done via the
 * linear map which has the Tagged attribute.
 */
-   if (!test_and_set_bit(PG_mte_tagged, _PAGE(0)->flags))
+   if (!page_mte_tagged(ZERO_PAGE(0))) {
mte_clear_page_tags(lm_alias(empty_zero_page));
+   set_page_mte_tagged(ZERO_PAGE(0));
+   }
 
kasan_init_hw_tags_cpu();
 }
diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c
index 27ef7ad3ffd2..353009d7f307 100644
--- a/arch/arm64/kernel/elfcore.c
+++ b/arch/arm64/kernel/elfcore.c
@@ -47,7 +47,7 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
 * Pages mapped in user space as !pte_access_permitted() (e.g.
 * PROT_EXEC only) may not have the PG_mte_tagged flag set.
  

[PATCH v4 1/8] mm: Do not enable PG_arch_2 for all 64-bit architectures

2022-09-20 Thread Peter Collingbourne
From: Catalin Marinas 

Commit 4beba9486abd ("mm: Add PG_arch_2 page flag") introduced a new
page flag for all 64-bit architectures. However, even if an architecture
is 64-bit, it may still have limited spare bits in the 'flags' member of
'struct page'. This may happen if an architecture enables SPARSEMEM
without SPARSEMEM_VMEMMAP as is the case with the newly added loongarch.
This architecture port needs 19 more bits for the sparsemem section
information and, while it is currently fine with PG_arch_2, adding any
more PG_arch_* flags will trigger build-time warnings.

Add a new CONFIG_ARCH_USES_PG_ARCH_X option which can be selected by
architectures that need more PG_arch_* flags beyond PG_arch_1. Select it
on arm64.

Signed-off-by: Catalin Marinas 
Signed-off-by: Peter Collingbourne 
Reported-by: kernel test robot 
Cc: Andrew Morton 
Cc: Steven Price 
---
 arch/arm64/Kconfig | 1 +
 fs/proc/page.c | 2 +-
 include/linux/page-flags.h | 2 +-
 include/trace/events/mmflags.h | 8 
 mm/Kconfig | 8 
 mm/huge_memory.c   | 2 +-
 6 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index f6737d2f37b2..f2435b62e0ba 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1948,6 +1948,7 @@ config ARM64_MTE
depends on ARM64_PAN
select ARCH_HAS_SUBPAGE_FAULTS
select ARCH_USES_HIGH_VMA_FLAGS
+   select ARCH_USES_PG_ARCH_X
help
  Memory Tagging (part of the ARMv8.5 Extensions) provides
  architectural support for run-time, always-on detection of
diff --git a/fs/proc/page.c b/fs/proc/page.c
index a2873a617ae8..6f4b4bcb9b0d 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -218,7 +218,7 @@ u64 stable_page_flags(struct page *page)
u |= kpf_copy_bit(k, KPF_PRIVATE_2, PG_private_2);
u |= kpf_copy_bit(k, KPF_OWNER_PRIVATE, PG_owner_priv_1);
u |= kpf_copy_bit(k, KPF_ARCH,  PG_arch_1);
-#ifdef CONFIG_64BIT
+#ifdef CONFIG_ARCH_USES_PG_ARCH_X
u |= kpf_copy_bit(k, KPF_ARCH_2,PG_arch_2);
 #endif
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0b0ae5084e60..5dc7977edf9d 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -132,7 +132,7 @@ enum pageflags {
PG_young,
PG_idle,
 #endif
-#ifdef CONFIG_64BIT
+#ifdef CONFIG_ARCH_USES_PG_ARCH_X
PG_arch_2,
 #endif
 #ifdef CONFIG_KASAN_HW_TAGS
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 11524cda4a95..4673e58a7626 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -90,10 +90,10 @@
 #define IF_HAVE_PG_IDLE(flag,string)
 #endif
 
-#ifdef CONFIG_64BIT
-#define IF_HAVE_PG_ARCH_2(flag,string) ,{1UL << flag, string}
+#ifdef CONFIG_ARCH_USES_PG_ARCH_X
+#define IF_HAVE_PG_ARCH_X(flag,string) ,{1UL << flag, string}
 #else
-#define IF_HAVE_PG_ARCH_2(flag,string)
+#define IF_HAVE_PG_ARCH_X(flag,string)
 #endif
 
 #ifdef CONFIG_KASAN_HW_TAGS
@@ -129,7 +129,7 @@ IF_HAVE_PG_UNCACHED(PG_uncached,"uncached"  )   
\
 IF_HAVE_PG_HWPOISON(PG_hwpoison,   "hwpoison"  )   \
 IF_HAVE_PG_IDLE(PG_young,  "young" )   \
 IF_HAVE_PG_IDLE(PG_idle,   "idle"  )   \
-IF_HAVE_PG_ARCH_2(PG_arch_2,   "arch_2")   \
+IF_HAVE_PG_ARCH_X(PG_arch_2,   "arch_2")   \
 IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
 
 #define show_page_flags(flags) \
diff --git a/mm/Kconfig b/mm/Kconfig
index ceec438c0741..a976cbb07bd6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -999,6 +999,14 @@ config ARCH_USES_HIGH_VMA_FLAGS
 config ARCH_HAS_PKEYS
bool
 
+config ARCH_USES_PG_ARCH_X
+   bool
+   help
+ Enable the definition of PG_arch_x page flags with x > 1. Only
+ suitable for 64-bit architectures with CONFIG_FLATMEM or
+ CONFIG_SPARSEMEM_VMEMMAP enabled, otherwise there may not be
+ enough room for additional bits in page->flags.
+
 config VM_EVENT_COUNTERS
default y
bool "Enable VM event counters for /proc/vmstat" if EXPERT
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cc4a5f4791e..24974a4ce28f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2444,7 +2444,7 @@ static void __split_huge_page_tail(struct page *head, int 
tail,
 (1L << PG_workingset) |
 (1L << PG_locked) |
 (1L << PG_unevictable) |
-#ifdef CONFIG_64BIT
+#ifdef CONFIG_ARCH_USES_PG_ARCH_X
 (1L << PG_arch_2) |
 #endif
 (1L << PG_dirty) |
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu

[PATCH v4 0/8] KVM: arm64: permit MAP_SHARED mappings with MTE enabled

2022-09-20 Thread Peter Collingbourne
Hi,

This patch series allows VMMs to use shared mappings in MTE enabled
guests. The first five patches were taken from Catalin's tree [1] which
addressed some review feedback from when they were previously sent out
as v3 of this series. The first patch from Catalin's tree makes room
for an additional PG_arch_3 flag by making the newer PG_arch_* flags
arch-dependent. The next four patches are based on a series that
Catalin sent out prior to v3, whose cover letter [2] I quote from below:

> This series aims to fix the races between initialising the tags on a
> page and setting the PG_mte_tagged flag. Currently the flag is set
> either before or after that tag initialisation and this can lead to CoW
> copying stale tags. The first patch moves the flag setting after the
> tags have been initialised, solving the CoW issue. However, concurrent
> mprotect() on a shared mapping may (very rarely) lead to valid tags
> being zeroed.
>
> The second skips the sanitise_mte_tags() call in kvm_set_spte_gfn(),
> deferring it to user_mem_abort(). The outcome is that no
> sanitise_mte_tags() can be simplified to skip the pfn_to_online_page()
> check and only rely on VM_MTE_ALLOWED vma flag that can be checked in
> user_mem_abort().
>
> The third and fourth patches use PG_arch_3 as a lock for page tagging,
> based on Peter Collingbourne's idea of a two-bit lock.
>
> I think the first patch can be queued but the rest needs some in depth
> review and test. With this series (if correct) we could allos MAP_SHARED
> on KVM guest memory but this is to be discussed separately as there are
> some KVM ABI implications.

I rebased Catalin's tree onto -next and added the proposed userspace
enablement patches after the series. I've tested it on QEMU as well as
on MTE-capable hardware by booting a Linux kernel and userspace under
a crosvm with MTE support [3].

[1] git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux devel/mte-pg-flags
[2] 
https://lore.kernel.org/all/20220705142619.4135905-1-catalin.mari...@arm.com/
[3] https://chromium-review.googlesource.com/c/crosvm/crosvm/+/3892141

Catalin Marinas (4):
  mm: Do not enable PG_arch_2 for all 64-bit architectures
  arm64: mte: Fix/clarify the PG_mte_tagged semantics
  KVM: arm64: Simplify the sanitise_mte_tags() logic
  arm64: mte: Lock a page for MTE tag initialisation

Peter Collingbourne (4):
  mm: Add PG_arch_3 page flag
  KVM: arm64: unify the tests for VMAs in memslots when MTE is enabled
  KVM: arm64: permit all VM_MTE_ALLOWED mappings with MTE enabled
  Documentation: document the ABI changes for KVM_CAP_ARM_MTE

 Documentation/virt/kvm/api.rst|  5 ++-
 arch/arm64/Kconfig|  1 +
 arch/arm64/include/asm/mte.h  | 65 ++-
 arch/arm64/include/asm/pgtable.h  |  4 +-
 arch/arm64/kernel/cpufeature.c|  4 +-
 arch/arm64/kernel/elfcore.c   |  2 +-
 arch/arm64/kernel/hibernate.c |  2 +-
 arch/arm64/kernel/mte.c   | 16 
 arch/arm64/kvm/guest.c| 18 +
 arch/arm64/kvm/mmu.c  | 55 +++---
 arch/arm64/mm/copypage.c  |  7 +++-
 arch/arm64/mm/fault.c |  4 +-
 arch/arm64/mm/mteswap.c   | 13 ---
 fs/proc/page.c|  3 +-
 include/linux/kernel-page-flags.h |  1 +
 include/linux/page-flags.h|  3 +-
 include/trace/events/mmflags.h|  9 +++--
 mm/Kconfig|  8 
 mm/huge_memory.c  |  3 +-
 19 files changed, 152 insertions(+), 71 deletions(-)

-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 12/12] KVM: remove KVM_REQ_UNHALT

2022-09-20 Thread Sean Christopherson
From: Paolo Bonzini 

KVM_REQ_UNHALT is now unnecessary because it is replaced by the return
value of kvm_vcpu_block/kvm_vcpu_halt.  Remove it.

No functional change intended.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 Documentation/virt/kvm/vcpu-requests.rst | 28 +---
 arch/arm64/kvm/arm.c |  1 -
 arch/mips/kvm/emulate.c  |  1 -
 arch/powerpc/kvm/book3s_pr.c |  1 -
 arch/powerpc/kvm/book3s_pr_papr.c|  1 -
 arch/powerpc/kvm/booke.c |  1 -
 arch/powerpc/kvm/powerpc.c   |  1 -
 arch/riscv/kvm/vcpu_insn.c   |  1 -
 arch/s390/kvm/kvm-s390.c |  2 --
 arch/x86/kvm/x86.c   |  3 ---
 arch/x86/kvm/xen.c   |  1 -
 include/linux/kvm_host.h |  3 +--
 virt/kvm/kvm_main.c  |  4 +---
 13 files changed, 3 insertions(+), 45 deletions(-)

diff --git a/Documentation/virt/kvm/vcpu-requests.rst 
b/Documentation/virt/kvm/vcpu-requests.rst
index 31f62b64e07b..87f04c1fa53d 100644
--- a/Documentation/virt/kvm/vcpu-requests.rst
+++ b/Documentation/virt/kvm/vcpu-requests.rst
@@ -97,7 +97,7 @@ VCPU requests are simply bit indices of the 
``vcpu->requests`` bitmap.
 This means general bitops, like those documented in [atomic-ops]_ could
 also be used, e.g. ::
 
-  clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, >requests);
+  clear_bit(KVM_REQ_UNBLOCK & KVM_REQUEST_MASK, >requests);
 
 However, VCPU request users should refrain from doing so, as it would
 break the abstraction.  The first 8 bits are reserved for architecture
@@ -126,17 +126,6 @@ KVM_REQ_UNBLOCK
   or in order to update the interrupt routing and ensure that assigned
   devices will wake up the vCPU.
 
-KVM_REQ_UNHALT
-
-  This request may be made from the KVM common function kvm_vcpu_block(),
-  which is used to emulate an instruction that causes a CPU to halt until
-  one of an architectural specific set of events and/or interrupts is
-  received (determined by checking kvm_arch_vcpu_runnable()).  When that
-  event or interrupt arrives kvm_vcpu_block() makes the request.  This is
-  in contrast to when kvm_vcpu_block() returns due to any other reason,
-  such as a pending signal, which does not indicate the VCPU's halt
-  emulation should stop, and therefore does not make the request.
-
 KVM_REQ_OUTSIDE_GUEST_MODE
 
   This "request" ensures the target vCPU has exited guest mode prior to the
@@ -297,21 +286,6 @@ architecture dependent.  kvm_vcpu_block() calls 
kvm_arch_vcpu_runnable()
 to check if it should awaken.  One reason to do so is to provide
 architectures a function where requests may be checked if necessary.
 
-Clearing Requests
--
-
-Generally it only makes sense for the receiving VCPU thread to clear a
-request.  However, in some circumstances, such as when the requesting
-thread and the receiving VCPU thread are executed serially, such as when
-they are the same thread, or when they are using some form of concurrency
-control to temporarily execute synchronously, then it's possible to know
-that the request may be cleared immediately, rather than waiting for the
-receiving VCPU thread to handle the request in VCPU RUN.  The only current
-examples of this are kvm_vcpu_block() calls made by VCPUs to block
-themselves.  A possible side-effect of that call is to make the
-KVM_REQ_UNHALT request, which may then be cleared immediately when the
-VCPU returns from the call.
-
 References
 ==
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 2ff0ef62abad..4f949b64fdc9 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -666,7 +666,6 @@ void kvm_vcpu_wfi(struct kvm_vcpu *vcpu)
 
kvm_vcpu_halt(vcpu);
vcpu_clear_flag(vcpu, IN_WFIT);
-   kvm_clear_request(KVM_REQ_UNHALT, vcpu);
 
preempt_disable();
vgic_v4_load(vcpu);
diff --git a/arch/mips/kvm/emulate.c b/arch/mips/kvm/emulate.c
index 1d7c56defe93..edaec93a1a1f 100644
--- a/arch/mips/kvm/emulate.c
+++ b/arch/mips/kvm/emulate.c
@@ -958,7 +958,6 @@ enum emulation_result kvm_mips_emul_wait(struct kvm_vcpu 
*vcpu)
 * We are runnable, then definitely go off to user space to
 * check if any I/O interrupts are pending.
 */
-   kvm_clear_request(KVM_REQ_UNHALT, vcpu);
if (kvm_arch_vcpu_runnable(vcpu))
vcpu->run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN;
}
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index d6abed6e51e6..9fc4dd8f66eb 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -499,7 +499,6 @@ static void kvmppc_set_msr_pr(struct kvm_vcpu *vcpu, u64 
msr)
if (msr & MSR_POW) {
if (!vcpu->arch.pending_exceptions) {
kvm_vcpu_halt(vcpu);
-   kvm_clear_request(KVM_REQ_UNHALT, 

[PATCH v4 09/12] KVM: x86: Don't snapshot pending INIT/SIPI prior to checking nested events

2022-09-20 Thread Sean Christopherson
Don't snapshot pending INIT/SIPI events prior to checking nested events,
architecturally there's nothing wrong with KVM processing (dropping) a
SIPI that is received immediately after synthesizing a VM-Exit.  Taking
and consuming the snapshot makes the flow way more subtle than it needs
to be, e.g. nVMX consumes/clears events that trigger VM-Exit (INIT/SIPI),
and so at first glance it appears that KVM is double-dipping on pending
INITs and SIPIs.  But that's not the case because INIT is blocked
unconditionally in VMX root mode the CPU cannot be in wait-for_SIPI after
VM-Exit, i.e. the paths that truly consume the snapshot are unreachable
if apic->pending_events is modified by kvm_check_nested_events().

nSVM is a similar story as GIF is cleared by the CPU on VM-Exit; INIT is
blocked regardless of whether or not it was pending prior to VM-Exit.

Drop the snapshot logic so that a future fix doesn't create weirdness
when kvm_vcpu_running()'s call to kvm_check_nested_events() is moved to
vcpu_block().  In that case, kvm_check_nested_events() will be called
immediately before kvm_apic_accept_events(), which raises the obvious
question of why that change doesn't break the snapshot logic.

Note, there is a subtle functional change.  Previously, KVM would clear
pending SIPIs if and only SIPI was pending prior to VM-Exit, whereas now
KVM clears pending SIPI unconditionally if INIT+SIPI are blocked.  The
latter is architecturally allowed, as SIPI is ignored if the CPU is not
in wait-for-SIPI mode (arguably, KVM should be even more aggressive in
dropping SIPIs).  It is software's responsibility to ensure the SIPI is
delivered, i.e. software shouldn't be firing INIT-SIPI at a CPU until
it knows with 100% certaining that the target CPU isn't in VMX root mode.

Furthermore, the existing code is extra weird as SIPIs that arrive after
VM-Exit _are_ dropped if there also happened to be a pending SIPI before
VM-Exit.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/lapic.c | 36 ++--
 1 file changed, 10 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 2bd90effc653..d7639d126e6c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -3025,17 +3025,8 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
struct kvm_lapic *apic = vcpu->arch.apic;
u8 sipi_vector;
int r;
-   unsigned long pe;
 
-   if (!lapic_in_kernel(vcpu))
-   return 0;
-
-   /*
-* Read pending events before calling the check_events
-* callback.
-*/
-   pe = smp_load_acquire(>pending_events);
-   if (!pe)
+   if (!kvm_apic_has_pending_init_or_sipi(vcpu))
return 0;
 
if (is_guest_mode(vcpu)) {
@@ -3043,38 +3034,31 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
if (r < 0)
return r == -EBUSY ? 0 : r;
/*
-* If an event has happened and caused a vmexit,
-* we know INITs are latched and therefore
-* we will not incorrectly deliver an APIC
-* event instead of a vmexit.
+* Continue processing INIT/SIPI even if a nested VM-Exit
+* occurred, e.g. pending SIPIs should be dropped if INIT+SIPI
+* are blocked as a result of transitioning to VMX root mode.
 */
}
 
/*
-* INITs are blocked while CPU is in specific states
-* (SMM, VMX root mode, SVM with GIF=0).
-* Because a CPU cannot be in these states immediately
-* after it has processed an INIT signal (and thus in
-* KVM_MP_STATE_INIT_RECEIVED state), just eat SIPIs
-* and leave the INIT pending.
+* INITs are blocked while CPU is in specific states (SMM, VMX root
+* mode, SVM with GIF=0), while SIPIs are dropped if the CPU isn't in
+* wait-for-SIPI (WFS).
 */
if (!kvm_apic_init_sipi_allowed(vcpu)) {
WARN_ON_ONCE(vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED);
-   if (test_bit(KVM_APIC_SIPI, ))
-   clear_bit(KVM_APIC_SIPI, >pending_events);
+   clear_bit(KVM_APIC_SIPI, >pending_events);
return 0;
}
 
-   if (test_bit(KVM_APIC_INIT, )) {
-   clear_bit(KVM_APIC_INIT, >pending_events);
+   if (test_and_clear_bit(KVM_APIC_INIT, >pending_events)) {
kvm_vcpu_reset(vcpu, true);
if (kvm_vcpu_is_bsp(apic->vcpu))
vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
else
vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
}
-   if (test_bit(KVM_APIC_SIPI, )) {
-   clear_bit(KVM_APIC_SIPI, >pending_events);
+   if (test_and_clear_bit(KVM_APIC_SIPI, >pending_events)) {
if (vcpu->arch.mp_state == 

[PATCH v4 11/12] KVM: mips, x86: do not rely on KVM_REQ_UNHALT

2022-09-20 Thread Sean Christopherson
From: Paolo Bonzini 

KVM_REQ_UNHALT is a weird request that simply reports the value of
kvm_arch_vcpu_runnable() on exit from kvm_vcpu_halt().  Only
MIPS and x86 are looking at it, the others just clear it.  Check
the state of the vCPU directly so that the request is handled
as a nop on all architectures.

No functional change intended, except for corner cases where an
event arrive immediately after a signal become pending or after
another similar host-side event.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/mips/kvm/emulate.c | 7 +++
 arch/x86/kvm/x86.c  | 9 -
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/mips/kvm/emulate.c b/arch/mips/kvm/emulate.c
index b494d8d39290..1d7c56defe93 100644
--- a/arch/mips/kvm/emulate.c
+++ b/arch/mips/kvm/emulate.c
@@ -955,13 +955,12 @@ enum emulation_result kvm_mips_emul_wait(struct kvm_vcpu 
*vcpu)
kvm_vcpu_halt(vcpu);
 
/*
-* We we are runnable, then definitely go off to user space to
+* We are runnable, then definitely go off to user space to
 * check if any I/O interrupts are pending.
 */
-   if (kvm_check_request(KVM_REQ_UNHALT, vcpu)) {
-   kvm_clear_request(KVM_REQ_UNHALT, vcpu);
+   kvm_clear_request(KVM_REQ_UNHALT, vcpu);
+   if (kvm_arch_vcpu_runnable(vcpu))
vcpu->run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN;
-   }
}
 
return EMULATE_DONE;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8aeacbc2bff9..8b1ff7e91ecb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10811,7 +10811,14 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
if (hv_timer)
kvm_lapic_switch_to_hv_timer(vcpu);
 
-   if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
+   kvm_clear_request(KVM_REQ_UNHALT, vcpu);
+
+   /*
+* If the vCPU is not runnable, a signal or another host event
+* of some kind is pending; service it without changing the
+* vCPU's activity state.
+*/
+   if (!kvm_arch_vcpu_runnable(vcpu))
return 1;
}
 
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 10/12] KVM: x86: never write to memory from kvm_vcpu_check_block()

2022-09-20 Thread Sean Christopherson
From: Paolo Bonzini 

kvm_vcpu_check_block() is called while not in TASK_RUNNING, and therefore
it cannot sleep.  Writing to guest memory is therefore forbidden, but it
can happen on AMD processors if kvm_check_nested_events() causes a vmexit.

Fortunately, all events that are caught by kvm_check_nested_events() are
also recognized by kvm_vcpu_has_events() through vendor callbacks such as
kvm_x86_interrupt_allowed() or kvm_x86_ops.nested_ops->has_events(), so
remove the call and postpone the actual processing to vcpu_block().

Opportunistically honor the return of kvm_check_nested_events().  KVM
punted on the check in kvm_vcpu_running() because the only error path is
if vmx_complete_nested_posted_interrupt() fails, in which case KVM exits
to userspace with "internal error" i.e. the VM is likely dead anyways so
it wasn't worth overloading the return of kvm_vcpu_running().

Add the check mostly so that KVM is consistent with itself; the return of
the call via kvm_apic_accept_events()=>kvm_check_nested_events() that
immediately follows  _is_ checked.

Reported-by: Maxim Levitsky 
Signed-off-by: Paolo Bonzini 
[sean: check and handle return of kvm_check_nested_events()]
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dcc675d4e44b..8aeacbc2bff9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10815,6 +10815,17 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
return 1;
}
 
+   /*
+* Evaluate nested events before exiting the halted state.  This allows
+* the halt state to be recorded properly in the VMCS12's activity
+* state field (AMD does not have a similar field and a VM-Exit always
+* causes a spurious wakeup from HLT).
+*/
+   if (is_guest_mode(vcpu)) {
+   if (kvm_check_nested_events(vcpu) < 0)
+   return 0;
+   }
+
if (kvm_apic_accept_events(vcpu) < 0)
return 0;
switch(vcpu->arch.mp_state) {
@@ -10837,9 +10848,6 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
 
 static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
 {
-   if (is_guest_mode(vcpu))
-   kvm_check_nested_events(vcpu);
-
return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
!vcpu->arch.apf.halted);
 }
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 08/12] KVM: nVMX: Make event request on VMXOFF iff INIT/SIPI is pending

2022-09-20 Thread Sean Christopherson
Explicitly check for a pending INIT/SIPI event when emulating VMXOFF
instead of blindly making an event request.  There's obviously no need
to evaluate events if none are pending.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/vmx/nested.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 5922531f6c52..8f67a9c4a287 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -5193,8 +5193,8 @@ static int handle_vmxoff(struct kvm_vcpu *vcpu)
 
free_nested(vcpu);
 
-   /* Process a latched INIT during time CPU was in VMX operation */
-   kvm_make_request(KVM_REQ_EVENT, vcpu);
+   if (kvm_apic_has_pending_init_or_sipi(vcpu))
+   kvm_make_request(KVM_REQ_EVENT, vcpu);
 
return nested_vmx_succeed(vcpu);
 }
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 07/12] KVM: nVMX: Make an event request if INIT or SIPI is pending on VM-Enter

2022-09-20 Thread Sean Christopherson
Evaluate interrupts, i.e. set KVM_REQ_EVENT, if INIT or SIPI is pending
when emulating nested VM-Enter.  INIT is blocked while the CPU is in VMX
root mode, but not in VMX non-root, i.e. becomes unblocked on VM-Enter.
This bug has been masked by KVM calling ->check_nested_events() in the
core run loop, but that hack will be fixed in the near future.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/vmx/nested.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 3a080051a4ec..5922531f6c52 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3377,6 +3377,8 @@ enum nvmx_vmentry_status 
nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
(CPU_BASED_INTR_WINDOW_EXITING | CPU_BASED_NMI_WINDOW_EXITING);
if (likely(!evaluate_pending_interrupts) && kvm_vcpu_apicv_active(vcpu))
evaluate_pending_interrupts |= vmx_has_apicv_interrupt(vcpu);
+   if (!evaluate_pending_interrupts)
+   evaluate_pending_interrupts |= 
kvm_apic_has_pending_init_or_sipi(vcpu);
 
if (!vmx->nested.nested_run_pending ||
!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS))
@@ -3457,18 +3459,10 @@ enum nvmx_vmentry_status 
nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
}
 
/*
-* If L1 had a pending IRQ/NMI until it executed
-* VMLAUNCH/VMRESUME which wasn't delivered because it was
-* disallowed (e.g. interrupts disabled), L0 needs to
-* evaluate if this pending event should cause an exit from L2
-* to L1 or delivered directly to L2 (e.g. In case L1 don't
-* intercept EXTERNAL_INTERRUPT).
-*
-* Usually this would be handled by the processor noticing an
-* IRQ/NMI window request, or checking RVI during evaluation of
-* pending virtual interrupts.  However, this setting was done
-* on VMCS01 and now VMCS02 is active instead. Thus, we force L0
-* to perform pending event evaluation by requesting a KVM_REQ_EVENT.
+* Re-evaluate pending events if L1 had a pending IRQ/NMI/INIT/SIPI
+* when it executed VMLAUNCH/VMRESUME, as entering non-root mode can
+* effectively unblock various events, e.g. INIT/SIPI cause VM-Exit
+* unconditionally.
 */
if (unlikely(evaluate_pending_interrupts))
kvm_make_request(KVM_REQ_EVENT, vcpu);
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 06/12] KVM: SVM: Make an event request if INIT or SIPI is pending when GIF is set

2022-09-20 Thread Sean Christopherson
Set KVM_REQ_EVENT if INIT or SIPI is pending when the guest enables GIF.
INIT in particular is blocked when GIF=0 and needs to be processed when
GIF is toggled to '1'.  This bug has been masked by (a) KVM calling
->check_nested_events() in the core run loop and (b) hypervisors toggling
GIF from 0=>1 only when entering guest mode (L1 entering L2).

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/svm/svm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index dd599afc85f5..58f0077d9357 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2339,7 +2339,8 @@ void svm_set_gif(struct vcpu_svm *svm, bool value)
enable_gif(svm);
if (svm->vcpu.arch.smi_pending ||
svm->vcpu.arch.nmi_pending ||
-   kvm_cpu_has_injectable_intr(>vcpu))
+   kvm_cpu_has_injectable_intr(>vcpu) ||
+   kvm_apic_has_pending_init_or_sipi(>vcpu))
kvm_make_request(KVM_REQ_EVENT, >vcpu);
} else {
disable_gif(svm);
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 03/12] KVM: x86: Rename and expose helper to detect if INIT/SIPI are allowed

2022-09-20 Thread Sean Christopherson
Rename and invert kvm_vcpu_latch_init() to kvm_apic_init_sipi_allowed()
so as to match the behavior of {interrupt,nmi,smi}_allowed(), and expose
the helper so that it can be used by kvm_vcpu_has_events() to determine
whether or not an INIT or SIPI is pending _and_ can be taken immediately.

Opportunistically replaced usage of the "latch" terminology with "blocked"
and/or "allowed", again to align with KVM's terminology used for all other
event types.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/lapic.c | 4 ++--
 arch/x86/kvm/lapic.h | 7 +++
 arch/x86/kvm/x86.c   | 9 +
 arch/x86/kvm/x86.h   | 5 -
 4 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9dda989a1cf0..2bd90effc653 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -3051,14 +3051,14 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
}
 
/*
-* INITs are latched while CPU is in specific states
+* INITs are blocked while CPU is in specific states
 * (SMM, VMX root mode, SVM with GIF=0).
 * Because a CPU cannot be in these states immediately
 * after it has processed an INIT signal (and thus in
 * KVM_MP_STATE_INIT_RECEIVED state), just eat SIPIs
 * and leave the INIT pending.
 */
-   if (kvm_vcpu_latch_init(vcpu)) {
+   if (!kvm_apic_init_sipi_allowed(vcpu)) {
WARN_ON_ONCE(vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED);
if (test_bit(KVM_APIC_SIPI, ))
clear_bit(KVM_APIC_SIPI, >pending_events);
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 117a46df5cc1..c3ce6b0b1ea3 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -7,6 +7,7 @@
 #include 
 
 #include "hyperv.h"
+#include "kvm_cache_regs.h"
 
 #define KVM_APIC_INIT  0
 #define KVM_APIC_SIPI  1
@@ -228,6 +229,12 @@ static inline bool kvm_apic_has_events(struct kvm_vcpu 
*vcpu)
return lapic_in_kernel(vcpu) && vcpu->arch.apic->pending_events;
 }
 
+static inline bool kvm_apic_init_sipi_allowed(struct kvm_vcpu *vcpu)
+{
+   return !is_smm(vcpu) &&
+  !static_call(kvm_x86_apic_init_signal_blocked)(vcpu);
+}
+
 static inline bool kvm_lowest_prio_delivery(struct kvm_lapic_irq *irq)
 {
return (irq->delivery_mode == APIC_DM_LOWEST ||
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e1a25e46dbf7..59be7b16b92f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11293,11 +11293,12 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu 
*vcpu,
goto out;
 
/*
-* KVM_MP_STATE_INIT_RECEIVED means the processor is in
-* INIT state; latched init should be reported using
-* KVM_SET_VCPU_EVENTS, so reject it here.
+* Pending INITs are reported using KVM_SET_VCPU_EVENTS, disallow
+* forcing the guest into INIT/SIPI if those events are supposed to be
+* blocked.  KVM prioritizes SMI over INIT, so reject INIT/SIPI state
+* if an SMI is pending as well.
 */
-   if ((kvm_vcpu_latch_init(vcpu) || vcpu->arch.smi_pending) &&
+   if ((!kvm_apic_init_sipi_allowed(vcpu) || vcpu->arch.smi_pending) &&
(mp_state->mp_state == KVM_MP_STATE_SIPI_RECEIVED ||
 mp_state->mp_state == KVM_MP_STATE_INIT_RECEIVED))
goto out;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index a784ff90740b..829d3134c1eb 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -275,11 +275,6 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, 
u64 quirk)
return !(kvm->arch.disabled_quirks & quirk);
 }
 
-static inline bool kvm_vcpu_latch_init(struct kvm_vcpu *vcpu)
-{
-   return is_smm(vcpu) || 
static_call(kvm_x86_apic_init_signal_blocked)(vcpu);
-}
-
 void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int 
inc_eip);
 
 u64 get_kvmclock_ns(struct kvm *kvm);
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 05/12] KVM: x86: lapic does not have to process INIT if it is blocked

2022-09-20 Thread Sean Christopherson
From: Paolo Bonzini 

Do not return true from kvm_vcpu_has_events() if the vCPU isn' going to
immediately process a pending INIT/SIPI.  INIT/SIPI shouldn't be treated
as wake events if they are blocked.

Signed-off-by: Paolo Bonzini 
[sean: rebase onto refactored INIT/SIPI helpers, massage changelog]
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 16a24dd28f26..dcc675d4e44b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12765,7 +12765,8 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu 
*vcpu)
if (!list_empty_careful(>async_pf.done))
return true;
 
-   if (kvm_apic_has_pending_init_or_sipi(vcpu))
+   if (kvm_apic_has_pending_init_or_sipi(vcpu) &&
+   kvm_apic_init_sipi_allowed(vcpu))
return true;
 
if (vcpu->arch.pv.pv_unhalted)
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 04/12] KVM: x86: Rename kvm_apic_has_events() to make it INIT/SIPI specific

2022-09-20 Thread Sean Christopherson
Rename kvm_apic_has_events() to kvm_apic_has_pending_init_or_sipi() so
that it's more obvious that "events" really just means "INIT or SIPI".

Opportunistically clean up a weirdly worded comment that referenced
kvm_apic_has_events() instead of kvm_apic_accept_events().

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/lapic.h | 2 +-
 arch/x86/kvm/x86.c   | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index c3ce6b0b1ea3..a5ac4a5a5179 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -224,7 +224,7 @@ static inline bool kvm_vcpu_apicv_active(struct kvm_vcpu 
*vcpu)
return lapic_in_kernel(vcpu) && vcpu->arch.apic->apicv_active;
 }
 
-static inline bool kvm_apic_has_events(struct kvm_vcpu *vcpu)
+static inline bool kvm_apic_has_pending_init_or_sipi(struct kvm_vcpu *vcpu)
 {
return lapic_in_kernel(vcpu) && vcpu->arch.apic->pending_events;
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 59be7b16b92f..16a24dd28f26 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11920,8 +11920,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool 
init_event)
struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;
 
/*
-* To avoid have the INIT path from kvm_apic_has_events() that 
be
-* called with loaded FPU and does not let userspace fix the 
state.
+* All paths that lead to INIT are required to load the guest's
+* FPU state (because most paths are buried in KVM_RUN).
 */
if (init_event)
kvm_put_guest_fpu(vcpu);
@@ -12765,7 +12765,7 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu 
*vcpu)
if (!list_empty_careful(>async_pf.done))
return true;
 
-   if (kvm_apic_has_events(vcpu))
+   if (kvm_apic_has_pending_init_or_sipi(vcpu))
return true;
 
if (vcpu->arch.pv.pv_unhalted)
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 02/12] KVM: nVMX: Make an event request when pending an MTF nested VM-Exit

2022-09-20 Thread Sean Christopherson
Set KVM_REQ_EVENT when MTF becomes pending to ensure that KVM will run
through inject_pending_event() and thus vmx_check_nested_events() prior
to re-entering the guest.

MTF currently works by virtue of KVM's hack that calls
kvm_check_nested_events() from kvm_vcpu_running(), but that hack will
be removed in the near future.  Until that call is removed, the patch
introduces no real functional change.

Fixes: 5ef8acbdd687 ("KVM: nVMX: Emulate MTF when performing instruction 
emulation")
Cc: sta...@vger.kernel.org
Reviewed-by: Maxim Levitsky 
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/vmx/nested.c | 3 +++
 arch/x86/kvm/vmx/vmx.c| 6 --
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 85318d803f4f..3a080051a4ec 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -6632,6 +6632,9 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu,
if (ret)
goto error_guest_mode;
 
+   if (vmx->nested.mtf_pending)
+   kvm_make_request(KVM_REQ_EVENT, vcpu);
+
return 0;
 
 error_guest_mode:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 94c314dc2393..9dba04b6b019 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1665,10 +1665,12 @@ static void vmx_update_emulated_instruction(struct 
kvm_vcpu *vcpu)
(!vcpu->arch.exception.pending ||
 vcpu->arch.exception.vector == DB_VECTOR) &&
(!vcpu->arch.exception_vmexit.pending ||
-vcpu->arch.exception_vmexit.vector == DB_VECTOR))
+vcpu->arch.exception_vmexit.vector == DB_VECTOR)) {
vmx->nested.mtf_pending = true;
-   else
+   kvm_make_request(KVM_REQ_EVENT, vcpu);
+   } else {
vmx->nested.mtf_pending = false;
+   }
 }
 
 static int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 00/12] KVM: x86: never write to memory from kvm_vcpu_check_block

2022-09-20 Thread Sean Christopherson
Non-x86 folks, there's nothing interesting to see here, y'all got pulled
in because removing KVM_REQ_UNHALT requires deleting kvm_clear_request()
from arch code.

Note, this based on:

https://github.com/sean-jc/linux.git tags/kvm-x86-6.1-1

to pre-resolve conflicts with the event/exception cleanups in there.


In Paolo's words...

The following backtrace:

[ 1355.807187]  kvm_vcpu_map+0x159/0x190 [kvm]
[ 1355.807628]  nested_svm_vmexit+0x4c/0x7f0 [kvm_amd]
[ 1355.808036]  ? kvm_vcpu_block+0x54/0xa0 [kvm]
[ 1355.808450]  svm_check_nested_events+0x97/0x390 [kvm_amd]
[ 1355.808920]  kvm_check_nested_events+0x1c/0x40 [kvm] 
[ 1355.809396]  kvm_arch_vcpu_runnable+0x4e/0x190 [kvm]
[ 1355.809892]  kvm_vcpu_check_block+0x4f/0x100 [kvm]
[ 1355.811259]  kvm_vcpu_block+0x6b/0xa0 [kvm] 

can occur due to kmap being called in non-sleepable (!TASK_RUNNING) context.
The fix is to extend kvm_x86_ops->nested_ops.hv_timer_pending() to cover
all events not already checked in kvm_arch_vcpu_is_runnable(), and then
get rid of the annoying (and wrong) call to kvm_check_nested_events()
from kvm_vcpu_check_block().

Beware, this is not a complete fix, because kvm_guest_apic_has_interrupt()
might still _read_ memory from non-sleepable context.  The fix here is
probably to make kvm_arch_vcpu_is_runnable() return -EAGAIN, and in that
case do a round of kvm_vcpu_check_block() polling in sleepable context.
Nevertheless, it is a good start as it pushes the vmexit into vcpu_block().

The series also does a small cleanup pass on kvm_vcpu_check_block(),
removing KVM_REQ_UNHALT in favor of simply calling kvm_arch_vcpu_runnable()
again.  Now that kvm_check_nested_events() is not called anymore by
kvm_arch_vcpu_runnable(), it is much easier to see that KVM will never
consume the event that caused kvm_vcpu_has_events() to return true,
and therefore it is safe to evaluate it again.

The alternative of propagating the return value of
kvm_arch_vcpu_runnable() up to kvm_vcpu_{block,halt}() is inferior
because it does not quite get right the edge cases where the vCPU becomes
runnable right before schedule() or right after kvm_vcpu_check_block().
While these edge cases are unlikely to truly matter in practice, it is
also pointless to get them "wrong".

v4:
  - Make event request if INIT/SIPI is pending when GIF=>1 (SVM) and
on nested VM-Enter (VMX).
  - Make an event request at VMXOFF iff it's necessary.
  - Keep the INIT/SIPI pending vs. blocked checks separate (for the
above nSVM/nVMX fixes).
  - Check the result of kvm_check_nested_events() in vcpu_block().
  - Rename INIT/SIPI helpers (hopefully we'll eventually rename all of
the related collateral, e.g. "pending_events" is so misleading).
  - Drop pending INIT/SIPI snaphsot to avoid creating weird, conflicting
code when kvm_check_nested_events() is called by vcpu_block().

v3:
  - https://lore.kernel.org/all/20220822170659.2527086-1-pbonz...@redhat.com
  - do not propagate the return value of  kvm_arch_vcpu_runnable() up to
kvm_vcpu_{block,halt}()
  - move and reformat the comment in vcpu_block()

move KVM_REQ_UNHALT removal last

Paolo Bonzini (5):
  KVM: x86: make vendor code check for all nested events
  KVM: x86: lapic does not have to process INIT if it is blocked
  KVM: x86: never write to memory from kvm_vcpu_check_block()
  KVM: mips, x86: do not rely on KVM_REQ_UNHALT
  KVM: remove KVM_REQ_UNHALT

Sean Christopherson (7):
  KVM: nVMX: Make an event request when pending an MTF nested VM-Exit
  KVM: x86: Rename and expose helper to detect if INIT/SIPI are allowed
  KVM: x86: Rename kvm_apic_has_events() to make it INIT/SIPI specific
  KVM: SVM: Make an event request if INIT or SIPI is pending when GIF is
set
  KVM: nVMX: Make an event request if INIT or SIPI is pending on
VM-Enter
  KVM: nVMX: Make event request on VMXOFF iff INIT/SIPI is pending
  KVM: x86: Don't snapshot pending INIT/SIPI prior to checking nested
events

 Documentation/virt/kvm/vcpu-requests.rst | 28 +--
 arch/arm64/kvm/arm.c |  1 -
 arch/mips/kvm/emulate.c  |  6 ++--
 arch/powerpc/kvm/book3s_pr.c |  1 -
 arch/powerpc/kvm/book3s_pr_papr.c|  1 -
 arch/powerpc/kvm/booke.c |  1 -
 arch/powerpc/kvm/powerpc.c   |  1 -
 arch/riscv/kvm/vcpu_insn.c   |  1 -
 arch/s390/kvm/kvm-s390.c |  2 --
 arch/x86/include/asm/kvm_host.h  |  2 +-
 arch/x86/kvm/lapic.c | 38 ++--
 arch/x86/kvm/lapic.h |  9 -
 arch/x86/kvm/svm/svm.c   |  3 +-
 arch/x86/kvm/vmx/nested.c| 33 +
 arch/x86/kvm/vmx/vmx.c   |  6 ++--
 arch/x86/kvm/x86.c   | 46 +++-
 arch/x86/kvm/x86.h   |  5 ---
 arch/x86/kvm/xen.c   |  1 -
 include/linux/kvm_host.h |  3 +-
 virt/kvm/kvm_main.c 

[PATCH v4 01/12] KVM: x86: make vendor code check for all nested events

2022-09-20 Thread Sean Christopherson
From: Paolo Bonzini 

Interrupts, NMIs etc. sent while in guest mode are already handled
properly by the *_interrupt_allowed callbacks, but other events can
cause a vCPU to be runnable that are specific to guest mode.

In the case of VMX there are two, the preemption timer and the
monitor trap.  The VMX preemption timer is already special cased via
the hv_timer_pending callback, but the purpose of the callback can be
easily extended to MTF or in fact any other event that can occur only
in guest mode.

Rename the callback and add an MTF check; kvm_arch_vcpu_runnable()
now can return true if an MTF is pending, without relying on
kvm_vcpu_running()'s call to kvm_check_nested_events().  Until that call
is removed, however, the patch introduces no functional change.

Reviewed-by: Maxim Levitsky 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 arch/x86/kvm/vmx/nested.c   | 8 +++-
 arch/x86/kvm/x86.c  | 8 
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ee940c4c0f64..c03590d1c5e1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1649,7 +1649,7 @@ struct kvm_x86_nested_ops {
bool (*is_exception_vmexit)(struct kvm_vcpu *vcpu, u8 vector,
u32 error_code);
int (*check_events)(struct kvm_vcpu *vcpu);
-   bool (*hv_timer_pending)(struct kvm_vcpu *vcpu);
+   bool (*has_events)(struct kvm_vcpu *vcpu);
void (*triple_fault)(struct kvm_vcpu *vcpu);
int (*get_state)(struct kvm_vcpu *vcpu,
 struct kvm_nested_state __user *user_kvm_nested_state,
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 4da0558943ce..85318d803f4f 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3929,6 +3929,12 @@ static bool nested_vmx_preemption_timer_pending(struct 
kvm_vcpu *vcpu)
   to_vmx(vcpu)->nested.preemption_timer_expired;
 }
 
+static bool vmx_has_nested_events(struct kvm_vcpu *vcpu)
+{
+   return nested_vmx_preemption_timer_pending(vcpu) ||
+  to_vmx(vcpu)->nested.mtf_pending;
+}
+
 /*
  * Per the Intel SDM's table "Priority Among Concurrent Events", with minor
  * edits to fill in missing examples, e.g. #DB due to split-lock accesses,
@@ -6971,7 +6977,7 @@ struct kvm_x86_nested_ops vmx_nested_ops = {
.leave_nested = vmx_leave_nested,
.is_exception_vmexit = nested_vmx_is_exception_vmexit,
.check_events = vmx_check_nested_events,
-   .hv_timer_pending = nested_vmx_preemption_timer_pending,
+   .has_events = vmx_has_nested_events,
.triple_fault = nested_vmx_triple_fault,
.get_state = vmx_get_nested_state,
.set_state = vmx_set_nested_state,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5b8328cb6c14..e1a25e46dbf7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9966,8 +9966,8 @@ static int kvm_check_and_inject_events(struct kvm_vcpu 
*vcpu,
}
 
if (is_guest_mode(vcpu) &&
-   kvm_x86_ops.nested_ops->hv_timer_pending &&
-   kvm_x86_ops.nested_ops->hv_timer_pending(vcpu))
+   kvm_x86_ops.nested_ops->has_events &&
+   kvm_x86_ops.nested_ops->has_events(vcpu))
*req_immediate_exit = true;
 
WARN_ON(kvm_is_exception_pending(vcpu));
@@ -12792,8 +12792,8 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu 
*vcpu)
return true;
 
if (is_guest_mode(vcpu) &&
-   kvm_x86_ops.nested_ops->hv_timer_pending &&
-   kvm_x86_ops.nested_ops->hv_timer_pending(vcpu))
+   kvm_x86_ops.nested_ops->has_events &&
+   kvm_x86_ops.nested_ops->has_events(vcpu))
return true;
 
if (kvm_xen_has_pending_events(vcpu))
-- 
2.37.3.968.ga6b4b080e4-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH kernel v2] KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE support platform dependent

2022-09-20 Thread Alexey Kardashevskiy




On 21/09/2022 02:08, Marc Zyngier wr
ote:

On Tue, 20 Sep 2022 13:51:43 +0100,
Alexey Kardashevskiy  wrote:


When introduced, IRQFD resampling worked on POWER8 with XICS. However
KVM on POWER9 has never implemented it - the compatibility mode code
("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native
XIVE mode does not handle INTx in KVM at all.

This moved the capability support advertising to platforms and stops
advertising it on XIVE, i.e. POWER9 and later.

Signed-off-by: Alexey Kardashevskiy 
Acked-by: Nicholas Piggin 
[For KVM RISC-V]
Acked-by: Anup Patel 
---
Changes:
v2:
* removed ifdef for ARM64.


The same argument applies to both x86 and s390, which do select
HAVE_KVM_IRQFD from the KVM config option. Only power allows this
option to be selected depending on the underlying interrupt controller
emulation.

As for riscv and mips, they don't select HAVE_KVM_IRQFD, and this
isn't a user-selectable option. So why do they get patched at all?


Before the patch, the capability was advertised on those, with your 
proposal it will stop. Which is a change in behavior which some may say 
requires a separate patch (like, one per platform). I am definitely 
overthinking it though and you are right.



My conclusion is that:

- only power needs the #ifdefery in the arch-specific code
- arm64, s390 and x86 can use KVM_CAP_IRQFD_RESAMPLE without a #ifdef
- mips and riscv should be left alone


Fair enough, thanks for the review! I'll post v3 shortly.



Thanks,

M.



--
Alexey
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 1/7] KVM: arm64: Discard any SVE state when entering KVM guests

2022-09-20 Thread Mark Brown
On Tue, Sep 20, 2022 at 05:44:01PM +0100, Marc Zyngier wrote:
> Mark Brown  wrote:

> >  void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu)
> >  {
> > BUG_ON(!current->mm);
> > -   BUG_ON(test_thread_flag(TIF_SVE));
> > +
> > +   fpsimd_kvm_prepare();
> 
> Why is this *before* the check against system_supports_fpsimd()? I
> don't think the architecture allows SVE without FP, for obvious
> reasons...

Good point, though now that I think about it I can't think of a
requirement for FP when implementing SME (there's certainly not
one for SVE).  There's no use for that hook now though.


signature.asc
Description: PGP signature
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm64: Limit stage2_apply_range() batch size to 1GB

2022-09-20 Thread Oliver Upton
Hey Ricardo,

On Tue, Sep 20, 2022 at 12:02:08PM -0700, Ricardo Koller wrote:
> On Tue, Sep 20, 2022 at 06:36:29PM +, Oliver Upton wrote:
> > Presently stage2_apply_range() works on a batch of memory addressed by a
> > stage 2 root table entry for the VM. Depending on the IPA limit of the
> > VM and PAGE_SIZE of the host, this could address a massive range of
> > memory. Some examples:
> > 
> >   4 level, 4K paging -> 512 GB batch size
> > 
> >   3 level, 64K paging -> 4TB batch size
> > 
> > Unsurprisingly, working on such a large range of memory can lead to soft
> > lockups. When running dirty_log_perf_test:
> > 
> >   ./dirty_log_perf_test -m -2 -s anonymous_thp -b 4G -v 48
> > 
> >   watchdog: BUG: soft lockup - CPU#0 stuck for 45s! [dirty_log_perf_:16703]
> >   Modules linked in: vfat fat cdc_ether usbnet mii xhci_pci xhci_hcd 
> > sha3_generic gq(O)
> >   CPU: 0 PID: 16703 Comm: dirty_log_perf_ Tainted: G   O   
> > 6.0.0-smp-DEV #1
> >   pstate: 8049 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >   pc : dcache_clean_inval_poc+0x24/0x38
> >   lr : clean_dcache_guest_page+0x28/0x4c
> >   sp : 800021763990
> >   pmr_save: 00e0
> >   x29: 800021763990 x28: 0005 x27: 0de0
> >   x26: 0001 x25: 00400830b13bc77f x24: ad4f91ead9c0
> >   x23:  x22: 882ad9c8 x21: fffafa7bc000
> >   x20: ad4f9066ce50 x19: 0003 x18: ad4f92402000
> >   x17: 011b x16: 011b x15: 0124
> >   x14: 07ff8301d280 x13:  x12: 
> >   x11: 00010001 x10: fc00 x9 : ad4f9069e580
> >   x8 : 000c x7 :  x6 : 003f
> >   x5 : 07ffa2076980 x4 : 0001 x3 : 003f
> >   x2 : 0040 x1 : 0830313bd000 x0 : 0830313bcc40
> >   Call trace:
> >dcache_clean_inval_poc+0x24/0x38
> >stage2_unmap_walker+0x138/0x1ec
> >__kvm_pgtable_walk+0x130/0x1d4
> >__kvm_pgtable_walk+0x170/0x1d4
> >__kvm_pgtable_walk+0x170/0x1d4
> >__kvm_pgtable_walk+0x170/0x1d4
> >kvm_pgtable_stage2_unmap+0xc4/0xf8
> >kvm_arch_flush_shadow_memslot+0xa4/0x10c
> >kvm_set_memslot+0xb8/0x454
> >__kvm_set_memory_region+0x194/0x244
> >kvm_vm_ioctl_set_memory_region+0x58/0x7c
> >kvm_vm_ioctl+0x49c/0x560
> >__arm64_sys_ioctl+0x9c/0xd4
> >invoke_syscall+0x4c/0x124
> >el0_svc_common+0xc8/0x194
> >do_el0_svc+0x38/0xc0
> >el0_svc+0x2c/0xa4
> >el0t_64_sync_handler+0x84/0xf0
> >el0t_64_sync+0x1a0/0x1a4
> > 
> > Given the various paging configurations used by KVM at stage 2 there
> > isn't a sensible page table level to use as the batch size. Use 1GB as
> > the batch size instead, as it is evenly divisible by all supported
> > hugepage sizes across 4K, 16K, and 64K paging.
> > 
> > Signed-off-by: Oliver Upton 
> > ---
> > 
> > Applies to 6.0-rc3. Tested with 4K and 64K pages with the above
> > dirty_log_perf_test command and noticed no more soft lockups. I don't
> > have a 16K system to test with.
> > 
> > Marc, we spoke about this a while ago and agreed to go for some page
> > table level based batching scheme. However, I decided against that
> > because it doesn't really solve the problem for non-4K kernels.
> > 
> >  arch/arm64/include/asm/stage2_pgtable.h | 20 
> >  arch/arm64/kvm/mmu.c|  8 +++-
> >  2 files changed, 7 insertions(+), 21 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/stage2_pgtable.h 
> > b/arch/arm64/include/asm/stage2_pgtable.h
> > index fe341a6578c3..c8dca8ae359c 100644
> > --- a/arch/arm64/include/asm/stage2_pgtable.h
> > +++ b/arch/arm64/include/asm/stage2_pgtable.h
> > @@ -10,13 +10,6 @@
> >  
> >  #include 
> >  
> > -/*
> > - * PGDIR_SHIFT determines the size a top-level page table entry can map
> > - * and depends on the number of levels in the page table. Compute the
> > - * PGDIR_SHIFT for a given number of levels.
> > - */
> > -#define pt_levels_pgdir_shift(lvls)ARM64_HW_PGTABLE_LEVEL_SHIFT(4 
> > - (lvls))
> > -
> >  /*
> >   * The hardware supports concatenation of up to 16 tables at stage2 entry
> >   * level and we use the feature whenever possible, which means we resolve 4
> > @@ -30,11 +23,6 @@
> >  #define stage2_pgtable_levels(ipa) ARM64_HW_PGTABLE_LEVELS((ipa) - 4)
> >  #define kvm_stage2_levels(kvm) VTCR_EL2_LVLS(kvm->arch.vtcr)
> >  
> > -/* stage2_pgdir_shift() is the size mapped by top-level stage2 entry for 
> > the VM */
> > -#define stage2_pgdir_shift(kvm)
> > pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
> > -#define stage2_pgdir_size(kvm) (1ULL << 
> > stage2_pgdir_shift(kvm))
> > -#define stage2_pgdir_mask(kvm) ~(stage2_pgdir_size(kvm) - 1)
> > -
> >  /*
> >   * kvm_mmmu_cache_min_pages() is the number of pages required to install
> >   * a stage-2 translation. We 

[PATCH v5] KVM: arm64: Ignore kvm-arm.mode if !is_hyp_mode_available()

2022-09-20 Thread Elliot Berman
Ignore kvm-arm.mode if !is_hyp_mode_available(). Specifically, we want
to avoid switching kvm_mode to KVM_MODE_PROTECTED if hypervisor mode is
not available. This prevents "Protected KVM" cpu capability being
reported when Linux is booting in EL1 and would not have KVM enabled.
Reasonably though, we should warn if the command line is requesting a
KVM mode at all if KVM isn't actually available. Allow
"kvm-arm.mode=none" to skip the warning since this would disable KVM
anyway.

Signed-off-by: Elliot Berman 
---
 arch/arm64/kvm/arm.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 8fe73ee5fa84..34cd0570e433 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -2271,6 +2271,16 @@ static int __init early_kvm_mode_cfg(char *arg)
if (!arg)
return -EINVAL;
 
+   if (strcmp(arg, "none") == 0) {
+   kvm_mode = KVM_MODE_NONE;
+   return 0;
+   }
+
+   if (!is_hyp_mode_available()) {
+   pr_warn_once("KVM is not available. Ignoring kvm-arm.mode\n");
+   return 0;
+   }
+
if (strcmp(arg, "protected") == 0) {
if (!is_kernel_in_hyp_mode())
kvm_mode = KVM_MODE_PROTECTED;
@@ -2285,11 +2295,6 @@ static int __init early_kvm_mode_cfg(char *arg)
return 0;
}
 
-   if (strcmp(arg, "none") == 0) {
-   kvm_mode = KVM_MODE_NONE;
-   return 0;
-   }
-
return -EINVAL;
 }
 early_param("kvm-arm.mode", early_kvm_mode_cfg);

base-commit: 0982c8d859f8f7022b9fd44d421c7ec721bb41f9
-- 
2.25.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 5/7] arm64/fpsimd: Load FP state based on recorded data type

2022-09-20 Thread Mark Brown
On Tue, Sep 20, 2022 at 07:19:57PM +0100, Marc Zyngier wrote:
> Mark Brown  wrote:

> > Now that we are recording the type of floating point register state we
> > are saving when we save it we can use that information when we load to
> > decide which register state is required and bring the TIF_SVE state into
> > sync with the loaded register state.

> Really, this sentence makes zero sense to me. Please at least add some
> punctuation, because the only words that spring to mind here are "DOES
> NOT COMPUTE".

I'll try to come up with something...

> > +   default:
> > +   /*
> > +* This should never happen, we should always
> > +* record what we saved when we save. We
> > +* always at least have the memory allocated
> > +* for FPSMID registers so try that and hope
> > +* for the best.
> > +*/
> > +   WARN_ON_ONCE(1);
> > +   clear_thread_flag(TIF_SVE);
> > +   break;

> What makes it impossible for FP_STATE_TASK to reach this point? If
> that's indeed an impossible case, please document it.

That's what the "we should always record what we saved when we
saved" is doing, and the comment in the header about it not being
valid to record _TASK as a saved state.  When we write the
register state to memory we must always write either FPSIMD or
SVE register values depending on which registers we saved state
for.  _TASK is not a meaningful state for stored register values.


signature.asc
Description: PGP signature
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm64: Limit stage2_apply_range() batch size to 1GB

2022-09-20 Thread Ricardo Koller
On Tue, Sep 20, 2022 at 06:36:29PM +, Oliver Upton wrote:
> Presently stage2_apply_range() works on a batch of memory addressed by a
> stage 2 root table entry for the VM. Depending on the IPA limit of the
> VM and PAGE_SIZE of the host, this could address a massive range of
> memory. Some examples:
> 
>   4 level, 4K paging -> 512 GB batch size
> 
>   3 level, 64K paging -> 4TB batch size
> 
> Unsurprisingly, working on such a large range of memory can lead to soft
> lockups. When running dirty_log_perf_test:
> 
>   ./dirty_log_perf_test -m -2 -s anonymous_thp -b 4G -v 48
> 
>   watchdog: BUG: soft lockup - CPU#0 stuck for 45s! [dirty_log_perf_:16703]
>   Modules linked in: vfat fat cdc_ether usbnet mii xhci_pci xhci_hcd 
> sha3_generic gq(O)
>   CPU: 0 PID: 16703 Comm: dirty_log_perf_ Tainted: G   O   
> 6.0.0-smp-DEV #1
>   pstate: 8049 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>   pc : dcache_clean_inval_poc+0x24/0x38
>   lr : clean_dcache_guest_page+0x28/0x4c
>   sp : 800021763990
>   pmr_save: 00e0
>   x29: 800021763990 x28: 0005 x27: 0de0
>   x26: 0001 x25: 00400830b13bc77f x24: ad4f91ead9c0
>   x23:  x22: 882ad9c8 x21: fffafa7bc000
>   x20: ad4f9066ce50 x19: 0003 x18: ad4f92402000
>   x17: 011b x16: 011b x15: 0124
>   x14: 07ff8301d280 x13:  x12: 
>   x11: 00010001 x10: fc00 x9 : ad4f9069e580
>   x8 : 000c x7 :  x6 : 003f
>   x5 : 07ffa2076980 x4 : 0001 x3 : 003f
>   x2 : 0040 x1 : 0830313bd000 x0 : 0830313bcc40
>   Call trace:
>dcache_clean_inval_poc+0x24/0x38
>stage2_unmap_walker+0x138/0x1ec
>__kvm_pgtable_walk+0x130/0x1d4
>__kvm_pgtable_walk+0x170/0x1d4
>__kvm_pgtable_walk+0x170/0x1d4
>__kvm_pgtable_walk+0x170/0x1d4
>kvm_pgtable_stage2_unmap+0xc4/0xf8
>kvm_arch_flush_shadow_memslot+0xa4/0x10c
>kvm_set_memslot+0xb8/0x454
>__kvm_set_memory_region+0x194/0x244
>kvm_vm_ioctl_set_memory_region+0x58/0x7c
>kvm_vm_ioctl+0x49c/0x560
>__arm64_sys_ioctl+0x9c/0xd4
>invoke_syscall+0x4c/0x124
>el0_svc_common+0xc8/0x194
>do_el0_svc+0x38/0xc0
>el0_svc+0x2c/0xa4
>el0t_64_sync_handler+0x84/0xf0
>el0t_64_sync+0x1a0/0x1a4
> 
> Given the various paging configurations used by KVM at stage 2 there
> isn't a sensible page table level to use as the batch size. Use 1GB as
> the batch size instead, as it is evenly divisible by all supported
> hugepage sizes across 4K, 16K, and 64K paging.
> 
> Signed-off-by: Oliver Upton 
> ---
> 
> Applies to 6.0-rc3. Tested with 4K and 64K pages with the above
> dirty_log_perf_test command and noticed no more soft lockups. I don't
> have a 16K system to test with.
> 
> Marc, we spoke about this a while ago and agreed to go for some page
> table level based batching scheme. However, I decided against that
> because it doesn't really solve the problem for non-4K kernels.
> 
>  arch/arm64/include/asm/stage2_pgtable.h | 20 
>  arch/arm64/kvm/mmu.c|  8 +++-
>  2 files changed, 7 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h 
> b/arch/arm64/include/asm/stage2_pgtable.h
> index fe341a6578c3..c8dca8ae359c 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -10,13 +10,6 @@
>  
>  #include 
>  
> -/*
> - * PGDIR_SHIFT determines the size a top-level page table entry can map
> - * and depends on the number of levels in the page table. Compute the
> - * PGDIR_SHIFT for a given number of levels.
> - */
> -#define pt_levels_pgdir_shift(lvls)  ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - (lvls))
> -
>  /*
>   * The hardware supports concatenation of up to 16 tables at stage2 entry
>   * level and we use the feature whenever possible, which means we resolve 4
> @@ -30,11 +23,6 @@
>  #define stage2_pgtable_levels(ipa)   ARM64_HW_PGTABLE_LEVELS((ipa) - 4)
>  #define kvm_stage2_levels(kvm)   VTCR_EL2_LVLS(kvm->arch.vtcr)
>  
> -/* stage2_pgdir_shift() is the size mapped by top-level stage2 entry for the 
> VM */
> -#define stage2_pgdir_shift(kvm)  
> pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
> -#define stage2_pgdir_size(kvm)   (1ULL << 
> stage2_pgdir_shift(kvm))
> -#define stage2_pgdir_mask(kvm)   ~(stage2_pgdir_size(kvm) - 1)
> -
>  /*
>   * kvm_mmmu_cache_min_pages() is the number of pages required to install
>   * a stage-2 translation. We pre-allocate the entry level page table at
> @@ -42,12 +30,4 @@
>   */
>  #define kvm_mmu_cache_min_pages(kvm) (kvm_stage2_levels(kvm) - 1)
>  
> -static inline phys_addr_t
> -stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> -{
> - phys_addr_t boundary = (addr + 

Re: [PATCH v7 08/13] KVM: selftests: Use the right memslot for code, page-tables, and data allocations

2022-09-20 Thread Ricardo Koller
On Tue, Sep 20, 2022 at 06:40:03PM +, Sean Christopherson wrote:
> On Tue, Sep 20, 2022, Ricardo Koller wrote:
> > On Tue, Sep 20, 2022 at 06:07:13PM +, Sean Christopherson wrote:
> > > On Tue, Sep 20, 2022, Ricardo Koller wrote:
> > > > The previous commit added support for callers of vm_create() to 
> > > > specify
> > > 
> > > Changelog is stale, vm_create() no longer takes the struct.
> > > 
> > > Side topic, it's usually a good idea to use "strong" terminology when 
> > > referencing
> > > past/future changes, e.g. if patches get shuffled around for whatever 
> > > reason,
> > > then "previous commit" may become stale/misleading.
> > > 
> > > It's fairly easy to convey the same info ("something happened recently" or
> > > "something is going to happen soon") without being so explicit, e.g.
> > > 
> > >   Wire up common code to use the appropriate code, page table, and data
> > >   memmslots that were recently added instead of hardcoding '0' for the
> > >   memslot.
> > > 
> > > or
> > > 
> > >   Now that kvm_vm allows specifying different memslots for code, page
> > >   tables, and data, use the appropriate memslot when making allocations
> > >   in common/libraty code.
> > > 
> > > > what memslots to use for code, page-tables, and data allocations. Change
> > > > them accordingly:
> > > > 
> > > > - stacks, code, and exception tables use the code memslot
> > > 
> > > Huh?  Stacks and exceptions are very much data, not code.
> > >
> > 
> > I would *really* like to have the data region only store test data. It
> > makes things easier for the test implementation, like owning the whole
> > region.
> 
> That's fine, but allocating stack as "code" is super confusing.
> 
> > At the same I wanted to have a single region for all the "core pages" like
> > code, descriptors, exception tables, stacks, etc. Not sure what to call it
> > though.
> 
> Why?  Code is very different than all those other things.  E.g. the main 
> reason
> KVM doesn't provide "not-executable" or "execute-only" memslots is because 
> there's
> never been a compelling use case, not because it's difficult to implement.  
> If KVM
> were to ever add such functionality, then we'd want/need selftests to have a
> dedicated code memslot.
> 
> > So, what about one of these 2 options:
> > 
> > Option A: 3 regions, where we call the "code" region something else, like
> > "core".
> > Option B: 4 regions: code, page-tables, core-data (stacks, exception 
> > tables, etc),
> > test-data.
> 
> I like (B), though I'd just call 'em "DATA" and "TEST_DATA".  IIUC, TEST_DATA 
> is
> the one you want to be special, i.e. it's ok if something that's not "core" 
> allocates
> in DATA, but it's not ok if "core" allocates in TEST_DATA.  That yields an 
> easy
> to understand "never use TEST_DATA" rule for library/common/core 
> functionality,
> with the code vs. page tables vs. data decision (hopefully) being fairly 
> obvious.
> 
> Defining CORE_DATA will force developers to make judgement calls and probably
> lead to bikeshedding over whether something is considered "core" code.

Sounds good, Option B then (with code, pt, data, test-data).

Thanks,
Ricardo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 4/7] arm64/fpsimd: Stop using TIF_SVE to manage register saving in KVM

2022-09-20 Thread Mark Brown
On Tue, Sep 20, 2022 at 07:04:24PM +0100, Marc Zyngier wrote:
> Mark Brown  wrote:

> > -   switch (last->to_save) {
> > -   case FP_STATE_TASK:
> > -   break;
> > -   case FP_STATE_FPSIMD:
> > -   WARN_ON_ONCE(save_sve_regs);
> > -   break;
> > -   case FP_STATE_SVE:
> > -   WARN_ON_ONCE(!save_sve_regs);
> > -   break;
> > -   }

> Given how short-lived this code is, consider dropping it altogether.
> Actually, the previous patch would make a lot more sense if it was
> merged with this one.

My thinking here is to introduce the state tracking and the
behaviour change separately to make it easier to unpick things if
anything goes wrong, it means that the behaviour change is in
clearly isolated patches separate to the more wide spread changes
to behaviour.  The early patches make it more explicit what we
are currently doing, the later ones do new things.


signature.asc
Description: PGP signature
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v7 08/13] KVM: selftests: Use the right memslot for code, page-tables, and data allocations

2022-09-20 Thread Sean Christopherson
On Tue, Sep 20, 2022, Ricardo Koller wrote:
> On Tue, Sep 20, 2022 at 06:07:13PM +, Sean Christopherson wrote:
> > On Tue, Sep 20, 2022, Ricardo Koller wrote:
> > > The previous commit added support for callers of vm_create() to 
> > > specify
> > 
> > Changelog is stale, vm_create() no longer takes the struct.
> > 
> > Side topic, it's usually a good idea to use "strong" terminology when 
> > referencing
> > past/future changes, e.g. if patches get shuffled around for whatever 
> > reason,
> > then "previous commit" may become stale/misleading.
> > 
> > It's fairly easy to convey the same info ("something happened recently" or
> > "something is going to happen soon") without being so explicit, e.g.
> > 
> >   Wire up common code to use the appropriate code, page table, and data
> >   memmslots that were recently added instead of hardcoding '0' for the
> >   memslot.
> > 
> > or
> > 
> >   Now that kvm_vm allows specifying different memslots for code, page
> >   tables, and data, use the appropriate memslot when making allocations
> >   in common/libraty code.
> > 
> > > what memslots to use for code, page-tables, and data allocations. Change
> > > them accordingly:
> > > 
> > > - stacks, code, and exception tables use the code memslot
> > 
> > Huh?  Stacks and exceptions are very much data, not code.
> >
> 
> I would *really* like to have the data region only store test data. It
> makes things easier for the test implementation, like owning the whole
> region.

That's fine, but allocating stack as "code" is super confusing.

> At the same I wanted to have a single region for all the "core pages" like
> code, descriptors, exception tables, stacks, etc. Not sure what to call it
> though.

Why?  Code is very different than all those other things.  E.g. the main reason
KVM doesn't provide "not-executable" or "execute-only" memslots is because 
there's
never been a compelling use case, not because it's difficult to implement.  If 
KVM
were to ever add such functionality, then we'd want/need selftests to have a
dedicated code memslot.

> So, what about one of these 2 options:
> 
> Option A: 3 regions, where we call the "code" region something else, like
> "core".
> Option B: 4 regions: code, page-tables, core-data (stacks, exception tables, 
> etc),
> test-data.

I like (B), though I'd just call 'em "DATA" and "TEST_DATA".  IIUC, TEST_DATA is
the one you want to be special, i.e. it's ok if something that's not "core" 
allocates
in DATA, but it's not ok if "core" allocates in TEST_DATA.  That yields an easy
to understand "never use TEST_DATA" rule for library/common/core functionality,
with the code vs. page tables vs. data decision (hopefully) being fairly 
obvious.

Defining CORE_DATA will force developers to make judgement calls and probably
lead to bikeshedding over whether something is considered "core" code.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH] KVM: arm64: Limit stage2_apply_range() batch size to 1GB

2022-09-20 Thread Oliver Upton
Presently stage2_apply_range() works on a batch of memory addressed by a
stage 2 root table entry for the VM. Depending on the IPA limit of the
VM and PAGE_SIZE of the host, this could address a massive range of
memory. Some examples:

  4 level, 4K paging -> 512 GB batch size

  3 level, 64K paging -> 4TB batch size

Unsurprisingly, working on such a large range of memory can lead to soft
lockups. When running dirty_log_perf_test:

  ./dirty_log_perf_test -m -2 -s anonymous_thp -b 4G -v 48

  watchdog: BUG: soft lockup - CPU#0 stuck for 45s! [dirty_log_perf_:16703]
  Modules linked in: vfat fat cdc_ether usbnet mii xhci_pci xhci_hcd 
sha3_generic gq(O)
  CPU: 0 PID: 16703 Comm: dirty_log_perf_ Tainted: G   O   
6.0.0-smp-DEV #1
  pstate: 8049 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : dcache_clean_inval_poc+0x24/0x38
  lr : clean_dcache_guest_page+0x28/0x4c
  sp : 800021763990
  pmr_save: 00e0
  x29: 800021763990 x28: 0005 x27: 0de0
  x26: 0001 x25: 00400830b13bc77f x24: ad4f91ead9c0
  x23:  x22: 882ad9c8 x21: fffafa7bc000
  x20: ad4f9066ce50 x19: 0003 x18: ad4f92402000
  x17: 011b x16: 011b x15: 0124
  x14: 07ff8301d280 x13:  x12: 
  x11: 00010001 x10: fc00 x9 : ad4f9069e580
  x8 : 000c x7 :  x6 : 003f
  x5 : 07ffa2076980 x4 : 0001 x3 : 003f
  x2 : 0040 x1 : 0830313bd000 x0 : 0830313bcc40
  Call trace:
   dcache_clean_inval_poc+0x24/0x38
   stage2_unmap_walker+0x138/0x1ec
   __kvm_pgtable_walk+0x130/0x1d4
   __kvm_pgtable_walk+0x170/0x1d4
   __kvm_pgtable_walk+0x170/0x1d4
   __kvm_pgtable_walk+0x170/0x1d4
   kvm_pgtable_stage2_unmap+0xc4/0xf8
   kvm_arch_flush_shadow_memslot+0xa4/0x10c
   kvm_set_memslot+0xb8/0x454
   __kvm_set_memory_region+0x194/0x244
   kvm_vm_ioctl_set_memory_region+0x58/0x7c
   kvm_vm_ioctl+0x49c/0x560
   __arm64_sys_ioctl+0x9c/0xd4
   invoke_syscall+0x4c/0x124
   el0_svc_common+0xc8/0x194
   do_el0_svc+0x38/0xc0
   el0_svc+0x2c/0xa4
   el0t_64_sync_handler+0x84/0xf0
   el0t_64_sync+0x1a0/0x1a4

Given the various paging configurations used by KVM at stage 2 there
isn't a sensible page table level to use as the batch size. Use 1GB as
the batch size instead, as it is evenly divisible by all supported
hugepage sizes across 4K, 16K, and 64K paging.

Signed-off-by: Oliver Upton 
---

Applies to 6.0-rc3. Tested with 4K and 64K pages with the above
dirty_log_perf_test command and noticed no more soft lockups. I don't
have a 16K system to test with.

Marc, we spoke about this a while ago and agreed to go for some page
table level based batching scheme. However, I decided against that
because it doesn't really solve the problem for non-4K kernels.

 arch/arm64/include/asm/stage2_pgtable.h | 20 
 arch/arm64/kvm/mmu.c|  8 +++-
 2 files changed, 7 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/include/asm/stage2_pgtable.h 
b/arch/arm64/include/asm/stage2_pgtable.h
index fe341a6578c3..c8dca8ae359c 100644
--- a/arch/arm64/include/asm/stage2_pgtable.h
+++ b/arch/arm64/include/asm/stage2_pgtable.h
@@ -10,13 +10,6 @@
 
 #include 
 
-/*
- * PGDIR_SHIFT determines the size a top-level page table entry can map
- * and depends on the number of levels in the page table. Compute the
- * PGDIR_SHIFT for a given number of levels.
- */
-#define pt_levels_pgdir_shift(lvls)ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - (lvls))
-
 /*
  * The hardware supports concatenation of up to 16 tables at stage2 entry
  * level and we use the feature whenever possible, which means we resolve 4
@@ -30,11 +23,6 @@
 #define stage2_pgtable_levels(ipa) ARM64_HW_PGTABLE_LEVELS((ipa) - 4)
 #define kvm_stage2_levels(kvm) VTCR_EL2_LVLS(kvm->arch.vtcr)
 
-/* stage2_pgdir_shift() is the size mapped by top-level stage2 entry for the 
VM */
-#define stage2_pgdir_shift(kvm)
pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
-#define stage2_pgdir_size(kvm) (1ULL << stage2_pgdir_shift(kvm))
-#define stage2_pgdir_mask(kvm) ~(stage2_pgdir_size(kvm) - 1)
-
 /*
  * kvm_mmmu_cache_min_pages() is the number of pages required to install
  * a stage-2 translation. We pre-allocate the entry level page table at
@@ -42,12 +30,4 @@
  */
 #define kvm_mmu_cache_min_pages(kvm)   (kvm_stage2_levels(kvm) - 1)
 
-static inline phys_addr_t
-stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
-{
-   phys_addr_t boundary = (addr + stage2_pgdir_size(kvm)) & 
stage2_pgdir_mask(kvm);
-
-   return (boundary - 1 < end - 1) ? boundary : end;
-}
-
 #endif /* __ARM64_S2_PGTABLE_H_ */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c9a13e487187..d64032b9fbb6 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -31,6 +31,12 @@ 

Re: [PATCH v3 3/7] arm64/fpsimd: Have KVM explicitly say which FP registers to save

2022-09-20 Thread Mark Brown
On Tue, Sep 20, 2022 at 06:52:59PM +0100, Marc Zyngier wrote:
> On Mon, 15 Aug 2022 23:55:25 +0100,
> Mark Brown  wrote:

> >  enum fp_state {
> > +   FP_STATE_TASK,  /* Save based on current, invalid as fp_type */

> How is that related to the FP_TYPE_TASK in the commit message? What

TYPE in the commit message should be STATE.

> does this 'invalid as fp_type' mean?

It means that using FP_STATE_TASK as a value for the fp_type
member of the task struck recording what type of state is
currently stored for the task is not valid, one of the other two
values representing what was actually saved must be chosen.

> > +   /*
> > +* For now we're just validating that the requested state is
> > +* consistent with what we'd otherwise work out.

> Nit: work out? or worked out? the "we'd" doesn't help disambiguate it
> for a non-native speaker.

we'd == we would so work out to match the tense.

> >  void fpsimd_bind_state_to_cpu(struct user_fpsimd_state *st, void 
> > *sve_state,
> >   unsigned int sve_vl, void *za_state,
> >   unsigned int sme_vl, u64 *svcr,
> > - enum fp_state *type)
> > + enum fp_state *type, enum fp_state to_save)

> OK, how many discrete arguments are we going to pass to this function,
> which most of them are part the vcpu structure? It really feels like
> what you want is a getter for the per-cpu structure, and let the KVM
> code do the actual business. If this function was supposed to provide
> some level of abstraction, well, it's a fail.

I agree that this is not an ideal interface, I am merely
following the previously chosen idiom since I haven't been able
to figure out why we were doing it in the first place and with a
lot of these things it turns out that there's some actual reason.

It's not even like fpsimd_bind_task_to_cpu() has ever been
written in terms of this function, there's two parallel
implementations.  My best guess was that it was some combination
of not peering at KVM internals and keeping struct
fpsimd_last_state_struct internal to fpsimd.c (since we're
effectively just passing one of those in in a more verbose form)
but never anything solid enough to be sure.

> >  void kvm_arch_vcpu_ctxsync_fp(struct kvm_vcpu *vcpu)
> >  {
> > +   enum fp_state fp_type;
> > +
> > WARN_ON_ONCE(!irqs_disabled());
> >  
> > if (vcpu->arch.fp_state == FP_STATE_GUEST_OWNED) {
> > +   if (vcpu_has_sve(vcpu))
> > +   fp_type = FP_STATE_SVE;

> Eventually, I'd like to relax this, and start tracking the actual use
> of the guest rather than assuming that SVE guest use SVE at all times
> (odds are they won't).

> I hope this series still leaves us with this option.

Yes, it probably makes it more tractable with KVM being able to
just say what type of state it wants to save so there's less to
take care of syncing with the host task so the code is a lot more
direct - it will just be a case of setting the desired fp_type
whenever a decision is made about what state type to save.


signature.asc
Description: PGP signature
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 2/7] arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE

2022-09-20 Thread Marc Zyngier
On Tue, 20 Sep 2022 19:09:15 +0100,
Mark Brown  wrote:
> 
> [1  ]
> On Tue, Sep 20, 2022 at 06:14:13PM +0100, Marc Zyngier wrote:
> > Mark Brown  wrote:
> 
> > > When we save the state for the floating point registers this can be done
> > > in the form visible through either the FPSIMD V registers or the SVE Z and
> > > P registers. At present we track which format is currently used based on
> > > TIF_SVE and the SME streaming mode state but particularly in the SVE case
> > > this limits our options for optimising things, especially around syscalls.
> > > Introduce a new enum in thread_struct which explicitly states which format
> > > is active and keep it up to date when we change it.
> 
> > > At present we do not use this state except to verify that it has the
> > > expected value when loading the state, future patches will introduce
> > > functional changes.
> 
> > > + enum fp_state fp_type;
> 
> > Is it a state or a type? Some consistency would help. Also, what does
> 
> We can bikeshed this either way - the state currently stored is
> of a particular type.  I'll probably go for type.

Then please do it consistently. At the moment, this is a bizarre mix
of the two, and this is already hard enough to reason about this that
we don't need extra complexity!

> 
> > this represent? Your commit message keeps talking about the FP/SVE
> > state for the host, but this is obviously a guest-related structure.
> > How do the two relate?
> 
> The commit message talks about saving the floating point state in
> general which is something we do for both the host and the guest.
> The optimisation cases I am focusing on right now are more on
> host usage but the complexity with tracking that currently blocks
> them crosses both host and guest, indeed the biggest improvement
> overall is probably that tracking the guest state stops requiring
> us to fiddle with the host task's state which to me at least
> makes things clearer.

At least for the KVM part, I want a clear comment explaining what this
tracks and how this is used, because at the moment, I'm only guessing.
And I've had enough guessing with this code...

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v7 08/13] KVM: selftests: Use the right memslot for code, page-tables, and data allocations

2022-09-20 Thread Ricardo Koller
On Tue, Sep 20, 2022 at 06:07:13PM +, Sean Christopherson wrote:
> On Tue, Sep 20, 2022, Ricardo Koller wrote:
> > The previous commit added support for callers of vm_create() to specify
> 
> Changelog is stale, vm_create() no longer takes the struct.
> 
> Side topic, it's usually a good idea to use "strong" terminology when 
> referencing
> past/future changes, e.g. if patches get shuffled around for whatever reason,
> then "previous commit" may become stale/misleading.
> 
> It's fairly easy to convey the same info ("something happened recently" or
> "something is going to happen soon") without being so explicit, e.g.
> 
>   Wire up common code to use the appropriate code, page table, and data
>   memmslots that were recently added instead of hardcoding '0' for the
>   memslot.
> 
> or
> 
>   Now that kvm_vm allows specifying different memslots for code, page
>   tables, and data, use the appropriate memslot when making allocations
>   in common/libraty code.
> 
> > what memslots to use for code, page-tables, and data allocations. Change
> > them accordingly:
> > 
> > - stacks, code, and exception tables use the code memslot
> 
> Huh?  Stacks and exceptions are very much data, not code.
>

I would *really* like to have the data region only store test data. It
makes things easier for the test implementation, like owning the whole
region. At the same I wanted to have a single region for all the "core
pages" like code, descriptors, exception tables, stacks, etc. Not sure
what to call it though. So, what about one of these 2 options:

Option A: 3 regions, where we call the "code" region something else, like
"core".
Option B: 4 regions: code, page-tables, core-data (stacks, exception tables, 
etc),
test-data.

> > - page tables and the pgd use the pt memslot
> > - data (anything allocated with vm_vaddr_alloc()) uses the data memslot
> > 
> > No functional change intended. All allocators keep using memslot #0.
> 
> ...
> 
> > -vm_vaddr_t vm_vaddr_alloc(struct kvm_vm *vm, size_t sz, vm_vaddr_t 
> > vaddr_min)
> > +vm_vaddr_t __vm_vaddr_alloc(struct kvm_vm *vm, size_t sz,
> > +   vm_vaddr_t vaddr_min, enum kvm_mem_region_type mrt)
> 
> Similar to other feedback, I'd prefer "type" over "mrt".
> 
> >  {
> > uint64_t pages = (sz >> vm->page_shift) + ((sz % vm->page_size) != 0);
> >  
> > virt_pgd_alloc(vm);
> > vm_paddr_t paddr = vm_phy_pages_alloc(vm, pages,
> > - KVM_UTIL_MIN_PFN * vm->page_size, 
> > 0);
> > +   KVM_UTIL_MIN_PFN * vm->page_size,
> > +   vm->memslots[mrt]);
> 
> Please align parameters.
> 
> >  
> > /*
> >  * Find an unused range of virtual page addresses of at least
> > @@ -1230,6 +1213,30 @@ vm_vaddr_t vm_vaddr_alloc(struct kvm_vm *vm, size_t 
> > sz, vm_vaddr_t vaddr_min)
> > return vaddr_start;
> >  }
> 
> ...
> 
> > +vm_vaddr_t vm_vaddr_alloc(struct kvm_vm *vm, size_t sz, vm_vaddr_t 
> > vaddr_min)
> > +{
> > +   return __vm_vaddr_alloc(vm, sz, vaddr_min, MEM_REGION_DATA);
> 
> I like the vm_alloc_page_table() wrapper, what about adding vm_alloc_code() 
> and
> vm_alloc_data() too?  Feel free to ignore if it's not much of a benefit.
> 
> > +}
> > +
> >  /*
> >   * VM Virtual Address Allocate Pages
> >   *
> > @@ -1249,6 +1256,11 @@ vm_vaddr_t vm_vaddr_alloc_pages(struct kvm_vm *vm, 
> > int nr_pages)
> > return vm_vaddr_alloc(vm, nr_pages * getpagesize(), KVM_UTIL_MIN_VADDR);
> >  }
> >  
> > +vm_vaddr_t __vm_vaddr_alloc_page(struct kvm_vm *vm, enum 
> > kvm_mem_region_type mrt)
> > +{
> > +   return __vm_vaddr_alloc(vm, getpagesize(), KVM_UTIL_MIN_VADDR, mrt);
> > +}
> > +
> >  /*
> >   * VM Virtual Address Allocate Page
> >   *
> > @@ -1814,7 +1826,8 @@ vm_paddr_t vm_phy_page_alloc(struct kvm_vm *vm, 
> > vm_paddr_t paddr_min,
> >  
> >  vm_paddr_t vm_alloc_page_table(struct kvm_vm *vm)
> >  {
> > -   return vm_phy_page_alloc(vm, KVM_GUEST_PAGE_TABLE_MIN_PADDR, 0);
> > +   return vm_phy_page_alloc(vm, KVM_GUEST_PAGE_TABLE_MIN_PADDR,
> > +vm->memslots[MEM_REGION_PT]);
> >  }
> >  
> >  /*
> > diff --git a/tools/testing/selftests/kvm/lib/riscv/processor.c 
> > b/tools/testing/selftests/kvm/lib/riscv/processor.c
> > index 604478151212..26c8d3dffb9a 100644
> > --- a/tools/testing/selftests/kvm/lib/riscv/processor.c
> > +++ b/tools/testing/selftests/kvm/lib/riscv/processor.c
> > @@ -58,7 +58,7 @@ void virt_arch_pgd_alloc(struct kvm_vm *vm)
> > if (!vm->pgd_created) {
> > vm_paddr_t paddr = vm_phy_pages_alloc(vm,
> > page_align(vm, ptrs_per_pte(vm) * 8) / vm->page_size,
> > -   KVM_GUEST_PAGE_TABLE_MIN_PADDR, 0);
> > +   KVM_GUEST_PAGE_TABLE_MIN_PADDR, 
> > vm->memslots[MEM_REGION_PT]);
> 
> 
> Ah, more copy-paste from aarch64.  Not your code (though the s390x change 
> below is
> new badness), but align params please.  Similar to newlines before function 

Re: [PATCH v3 5/7] arm64/fpsimd: Load FP state based on recorded data type

2022-09-20 Thread Marc Zyngier
On Mon, 15 Aug 2022 23:55:27 +0100,
Mark Brown  wrote:
> 
> Now that we are recording the type of floating point register state we
> are saving when we save it we can use that information when we load to
> decide which register state is required and bring the TIF_SVE state into
> sync with the loaded register state.

Really, this sentence makes zero sense to me. Please at least add some
punctuation, because the only words that spring to mind here are "DOES
NOT COMPUTE".

> 
> The SME state detauls are already recorded directly in the saved
> SVCR and handled based on the information there.
> 
> Since we are not changing any of the save paths there should be no
> functional change from this patch, further patches will make use of this
> to optimise and clarify the code.
> 
> Signed-off-by: Mark Brown 
> ---
>  arch/arm64/kernel/fpsimd.c | 39 ++
>  1 file changed, 31 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index aaea2dc02cbd..4096530dd4c6 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -392,11 +392,36 @@ static void task_fpsimd_load(void)
>   WARN_ON(!system_supports_fpsimd());
>   WARN_ON(!have_cpu_fpsimd_context());
>  
> - /* Check if we should restore SVE first */
> - if (IS_ENABLED(CONFIG_ARM64_SVE) && test_thread_flag(TIF_SVE)) {
> - sve_set_vq(sve_vq_from_vl(task_get_sve_vl(current)) - 1);
> - restore_sve_regs = true;
> - restore_ffr = true;
> + if (system_supports_sve()) {
> + switch (current->thread.fp_type) {
> + case FP_STATE_FPSIMD:
> + /* Stop tracking SVE for this task until next use. */
> + if (test_and_clear_thread_flag(TIF_SVE))
> + sve_user_disable();
> + break;
> + case FP_STATE_SVE:
> + if (!thread_sm_enabled(>thread) &&
> + !WARN_ON_ONCE(!test_and_set_thread_flag(TIF_SVE)))
> + sve_user_enable();
> +
> + if (test_thread_flag(TIF_SVE))
> + 
> sve_set_vq(sve_vq_from_vl(task_get_sve_vl(current)) - 1);
> +
> + restore_sve_regs = true;
> + restore_ffr = true;
> + break;
> + default:
> + /*
> +  * This should never happen, we should always
> +  * record what we saved when we save. We
> +  * always at least have the memory allocated
> +  * for FPSMID registers so try that and hope
> +  * for the best.
> +  */
> + WARN_ON_ONCE(1);
> + clear_thread_flag(TIF_SVE);
> + break;

What makes it impossible for FP_STATE_TASK to reach this point? If
that's indeed an impossible case, please document it.

> + }
>   }
>  
>   /* Restore SME, override SVE register configuration if needed */
> @@ -412,10 +437,8 @@ static void task_fpsimd_load(void)
>   if (thread_za_enabled(>thread))
>   za_load_state(current->thread.za_state);
>  
> - if (thread_sm_enabled(>thread)) {
> - restore_sve_regs = true;
> + if (thread_sm_enabled(>thread))
>   restore_ffr = system_supports_fa64();
> - }
>   }
>  
>   if (restore_sve_regs) {

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 2/7] arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE

2022-09-20 Thread Mark Brown
On Tue, Sep 20, 2022 at 06:14:13PM +0100, Marc Zyngier wrote:
> Mark Brown  wrote:

> > When we save the state for the floating point registers this can be done
> > in the form visible through either the FPSIMD V registers or the SVE Z and
> > P registers. At present we track which format is currently used based on
> > TIF_SVE and the SME streaming mode state but particularly in the SVE case
> > this limits our options for optimising things, especially around syscalls.
> > Introduce a new enum in thread_struct which explicitly states which format
> > is active and keep it up to date when we change it.

> > At present we do not use this state except to verify that it has the
> > expected value when loading the state, future patches will introduce
> > functional changes.

> > +   enum fp_state fp_type;

> Is it a state or a type? Some consistency would help. Also, what does

We can bikeshed this either way - the state currently stored is
of a particular type.  I'll probably go for type.

> this represent? Your commit message keeps talking about the FP/SVE
> state for the host, but this is obviously a guest-related structure.
> How do the two relate?

The commit message talks about saving the floating point state in
general which is something we do for both the host and the guest.
The optimisation cases I am focusing on right now are more on
host usage but the complexity with tracking that currently blocks
them crosses both host and guest, indeed the biggest improvement
overall is probably that tracking the guest state stops requiring
us to fiddle with the host task's state which to me at least
makes things clearer.


signature.asc
Description: PGP signature
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v7 08/13] KVM: selftests: Use the right memslot for code, page-tables, and data allocations

2022-09-20 Thread Sean Christopherson
On Tue, Sep 20, 2022, Ricardo Koller wrote:
> The previous commit added support for callers of vm_create() to specify

Changelog is stale, vm_create() no longer takes the struct.

Side topic, it's usually a good idea to use "strong" terminology when 
referencing
past/future changes, e.g. if patches get shuffled around for whatever reason,
then "previous commit" may become stale/misleading.

It's fairly easy to convey the same info ("something happened recently" or
"something is going to happen soon") without being so explicit, e.g.

  Wire up common code to use the appropriate code, page table, and data
  memmslots that were recently added instead of hardcoding '0' for the
  memslot.

or

  Now that kvm_vm allows specifying different memslots for code, page
  tables, and data, use the appropriate memslot when making allocations
  in common/libraty code.

> what memslots to use for code, page-tables, and data allocations. Change
> them accordingly:
> 
> - stacks, code, and exception tables use the code memslot

Huh?  Stacks and exceptions are very much data, not code.

> - page tables and the pgd use the pt memslot
> - data (anything allocated with vm_vaddr_alloc()) uses the data memslot
> 
> No functional change intended. All allocators keep using memslot #0.

...

> -vm_vaddr_t vm_vaddr_alloc(struct kvm_vm *vm, size_t sz, vm_vaddr_t vaddr_min)
> +vm_vaddr_t __vm_vaddr_alloc(struct kvm_vm *vm, size_t sz,
> + vm_vaddr_t vaddr_min, enum kvm_mem_region_type mrt)

Similar to other feedback, I'd prefer "type" over "mrt".

>  {
>   uint64_t pages = (sz >> vm->page_shift) + ((sz % vm->page_size) != 0);
>  
>   virt_pgd_alloc(vm);
>   vm_paddr_t paddr = vm_phy_pages_alloc(vm, pages,
> -   KVM_UTIL_MIN_PFN * vm->page_size, 
> 0);
> + KVM_UTIL_MIN_PFN * vm->page_size,
> + vm->memslots[mrt]);

Please align parameters.

>  
>   /*
>* Find an unused range of virtual page addresses of at least
> @@ -1230,6 +1213,30 @@ vm_vaddr_t vm_vaddr_alloc(struct kvm_vm *vm, size_t 
> sz, vm_vaddr_t vaddr_min)
>   return vaddr_start;
>  }

...

> +vm_vaddr_t vm_vaddr_alloc(struct kvm_vm *vm, size_t sz, vm_vaddr_t vaddr_min)
> +{
> + return __vm_vaddr_alloc(vm, sz, vaddr_min, MEM_REGION_DATA);

I like the vm_alloc_page_table() wrapper, what about adding vm_alloc_code() and
vm_alloc_data() too?  Feel free to ignore if it's not much of a benefit.

> +}
> +
>  /*
>   * VM Virtual Address Allocate Pages
>   *
> @@ -1249,6 +1256,11 @@ vm_vaddr_t vm_vaddr_alloc_pages(struct kvm_vm *vm, int 
> nr_pages)
>   return vm_vaddr_alloc(vm, nr_pages * getpagesize(), KVM_UTIL_MIN_VADDR);
>  }
>  
> +vm_vaddr_t __vm_vaddr_alloc_page(struct kvm_vm *vm, enum kvm_mem_region_type 
> mrt)
> +{
> + return __vm_vaddr_alloc(vm, getpagesize(), KVM_UTIL_MIN_VADDR, mrt);
> +}
> +
>  /*
>   * VM Virtual Address Allocate Page
>   *
> @@ -1814,7 +1826,8 @@ vm_paddr_t vm_phy_page_alloc(struct kvm_vm *vm, 
> vm_paddr_t paddr_min,
>  
>  vm_paddr_t vm_alloc_page_table(struct kvm_vm *vm)
>  {
> - return vm_phy_page_alloc(vm, KVM_GUEST_PAGE_TABLE_MIN_PADDR, 0);
> + return vm_phy_page_alloc(vm, KVM_GUEST_PAGE_TABLE_MIN_PADDR,
> +  vm->memslots[MEM_REGION_PT]);
>  }
>  
>  /*
> diff --git a/tools/testing/selftests/kvm/lib/riscv/processor.c 
> b/tools/testing/selftests/kvm/lib/riscv/processor.c
> index 604478151212..26c8d3dffb9a 100644
> --- a/tools/testing/selftests/kvm/lib/riscv/processor.c
> +++ b/tools/testing/selftests/kvm/lib/riscv/processor.c
> @@ -58,7 +58,7 @@ void virt_arch_pgd_alloc(struct kvm_vm *vm)
>   if (!vm->pgd_created) {
>   vm_paddr_t paddr = vm_phy_pages_alloc(vm,
>   page_align(vm, ptrs_per_pte(vm) * 8) / vm->page_size,
> - KVM_GUEST_PAGE_TABLE_MIN_PADDR, 0);
> + KVM_GUEST_PAGE_TABLE_MIN_PADDR, 
> vm->memslots[MEM_REGION_PT]);


Ah, more copy-paste from aarch64.  Not your code (though the s390x change below 
is
new badness), but align params please.  Similar to newlines before function 
names,
running over the 80 char soft limit is preferrable to weird alignment.  And for
the soft limit, it's often easy to stay under the soft limit by refactoring,
e.g. in a separate prep patch for RISC-V and aarch64, do:

size_t nr_pages = page_align(vm, ptrs_per_pgd(vm) * 8) / vm->page_size;

if (vm->pgd_created)
return;

vm->pgd = vm_phy_pages_alloc(vm, nr_pages,
 KVM_GUEST_PAGE_TABLE_MIN_PADDR,
 vm->memslots[MEM_REGION_PT]);
vm->pgd_created = true;

>   vm->pgd = paddr;
>   vm->pgd_created = true;
>   }
> @@ -282,8 +282,9 @@ struct kvm_vcpu *vm_arch_vcpu_add(struct kvm_vm *vm, 
> uint32_t vcpu_id,
>   size_t stack_size = 

Re: [PATCH v3 4/7] arm64/fpsimd: Stop using TIF_SVE to manage register saving in KVM

2022-09-20 Thread Marc Zyngier
On Mon, 15 Aug 2022 23:55:26 +0100,
Mark Brown  wrote:
> 
> Now that we are explicitly telling the host FP code which register state
> it needs to save we can remove the manipulation of TIF_SVE from the KVM
> code, simplifying it and allowing us to optimise our handling of normal
> tasks. Remove the manipulation of TIF_SVE from KVM and instead rely on
> to_save to ensure we save the correct data for it.
> 
> Signed-off-by: Mark Brown 
> ---
>  arch/arm64/kernel/fpsimd.c | 22 --
>  arch/arm64/kvm/fpsimd.c|  3 ---
>  2 files changed, 4 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 7be20ced2c45..aaea2dc02cbd 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -436,8 +436,8 @@ static void task_fpsimd_load(void)
>   * last, if KVM is involved this may be the guest VM context rather
>   * than the host thread for the VM pointed to by current. This means
>   * that we must always reference the state storage via last rather
> - * than via current, other than the TIF_ flags which KVM will
> - * carefully maintain for us.
> + * than via current, if we are saving KVM state then it will have
> + * ensured that the type of registers to save is set in last->to_save.
>   */
>  static void fpsimd_save(void)
>  {
> @@ -454,27 +454,13 @@ static void fpsimd_save(void)
>   if (test_thread_flag(TIF_FOREIGN_FPSTATE))
>   return;
>  
> - if (test_thread_flag(TIF_SVE)) {
> + if ((last->to_save == FP_STATE_TASK && test_thread_flag(TIF_SVE)) ||
> + last->to_save == FP_STATE_SVE) {
>   save_sve_regs = true;
>   save_ffr = true;
>   vl = last->sve_vl;
>   }
>  
> - /*
> -  * For now we're just validating that the requested state is
> -  * consistent with what we'd otherwise work out.
> -  */
> - switch (last->to_save) {
> - case FP_STATE_TASK:
> - break;
> - case FP_STATE_FPSIMD:
> - WARN_ON_ONCE(save_sve_regs);
> - break;
> - case FP_STATE_SVE:
> - WARN_ON_ONCE(!save_sve_regs);
> - break;
> - }
> -

Given how short-lived this code is, consider dropping it altogether.
Actually, the previous patch would make a lot more sense if it was
merged with this one.

>   if (system_supports_sme()) {
>   u64 *svcr = last->svcr;
>  
> diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c
> index db0b2bacaeb8..8a79823fce68 100644
> --- a/arch/arm64/kvm/fpsimd.c
> +++ b/arch/arm64/kvm/fpsimd.c
> @@ -151,7 +151,6 @@ void kvm_arch_vcpu_ctxsync_fp(struct kvm_vcpu *vcpu)
>>arch.fp_type, fp_type);
>  
>   clear_thread_flag(TIF_FOREIGN_FPSTATE);
> - update_thread_flag(TIF_SVE, vcpu_has_sve(vcpu));
>   }
>  }
>  
> @@ -208,7 +207,5 @@ void kvm_arch_vcpu_put_fp(struct kvm_vcpu *vcpu)
>   sysreg_clear_set(CPACR_EL1, CPACR_EL1_ZEN_EL0EN, 0);
>   }
>  
> - update_thread_flag(TIF_SVE, 0);
> -
>   local_irq_restore(flags);
>  }

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 3/7] arm64/fpsimd: Have KVM explicitly say which FP registers to save

2022-09-20 Thread Marc Zyngier
On Mon, 15 Aug 2022 23:55:25 +0100,
Mark Brown  wrote:
> 
> In order to avoid needlessly saving and restoring the guest registers KVM
> relies on the host FPSMID code to save the guest registers when we context
> switch away from the guest. This is done by binding the KVM guest state to
> the CPU on top of the task state that was originally there, then carefully
> managing the TIF_SVE flag for the task to cause the host to save the full
> SVE state when needed regardless of the needs of the host task. This works
> well enough but isn't terribly direct about what is going on and makes it
> much more complicated to try to optimise what we're doing with the SVE
> register state.
> 
> Let's instead have KVM pass in the register state it wants saving when it
> binds to the CPU. We introduce a new FP_TYPE_TASK for use during normal
> task binding to indicate that we should base our decisions on the current
> task. In order to ease any future debugging that might be required this
> patch does not actually update any of the decision making about what to
> save, it merely starts tracking the new information and warns if the
> requested state is not what we would otherwise have decided to save.
> 
> Signed-off-by: Mark Brown 
> ---
>  arch/arm64/include/asm/fpsimd.h|  3 ++-
>  arch/arm64/include/asm/processor.h |  1 +
>  arch/arm64/kernel/fpsimd.c | 20 +++-
>  arch/arm64/kvm/fpsimd.c|  9 -
>  4 files changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
> index b74103a79052..21a1dd320ca5 100644
> --- a/arch/arm64/include/asm/fpsimd.h
> +++ b/arch/arm64/include/asm/fpsimd.h
> @@ -61,7 +61,8 @@ extern void fpsimd_kvm_prepare(void);
>  extern void fpsimd_bind_state_to_cpu(struct user_fpsimd_state *state,
>void *sve_state, unsigned int sve_vl,
>void *za_state, unsigned int sme_vl,
> -  u64 *svcr, enum fp_state *type);
> +  u64 *svcr, enum fp_state *type,
> +  enum fp_state to_save);
>  
>  extern void fpsimd_flush_task_state(struct task_struct *target);
>  extern void fpsimd_save_and_flush_cpu_state(void);
> diff --git a/arch/arm64/include/asm/processor.h 
> b/arch/arm64/include/asm/processor.h
> index 4818a6b77f39..89c248b8d4ba 100644
> --- a/arch/arm64/include/asm/processor.h
> +++ b/arch/arm64/include/asm/processor.h
> @@ -123,6 +123,7 @@ enum vec_type {
>  };
>  
>  enum fp_state {
> + FP_STATE_TASK,  /* Save based on current, invalid as fp_type */

How is that related to the FP_TYPE_TASK in the commit message? What
does this 'invalid as fp_type' mean?

>   FP_STATE_FPSIMD,
>   FP_STATE_SVE,
>  };
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 6544ae00230f..7be20ced2c45 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -126,6 +126,7 @@ struct fpsimd_last_state_struct {
>   unsigned int sve_vl;
>   unsigned int sme_vl;
>   enum fp_state *fp_type;
> + enum fp_state to_save;
>  };
>  
>  static DEFINE_PER_CPU(struct fpsimd_last_state_struct, fpsimd_last_state);
> @@ -459,6 +460,21 @@ static void fpsimd_save(void)
>   vl = last->sve_vl;
>   }
>  
> + /*
> +  * For now we're just validating that the requested state is
> +  * consistent with what we'd otherwise work out.

Nit: work out? or worked out? the "we'd" doesn't help disambiguate it
for a non-native speaker.

> +  */
> + switch (last->to_save) {
> + case FP_STATE_TASK:
> + break;
> + case FP_STATE_FPSIMD:
> + WARN_ON_ONCE(save_sve_regs);
> + break;
> + case FP_STATE_SVE:
> + WARN_ON_ONCE(!save_sve_regs);
> + break;
> + }
> +
>   if (system_supports_sme()) {
>   u64 *svcr = last->svcr;
>  
> @@ -1693,6 +1709,7 @@ static void fpsimd_bind_task_to_cpu(void)
>   last->sme_vl = task_get_sme_vl(current);
>   last->svcr = >thread.svcr;
>   last->fp_type = >thread.fp_type;
> + last->to_save = FP_STATE_TASK;
>   current->thread.fpsimd_cpu = smp_processor_id();
>  
>   /*
> @@ -1717,7 +1734,7 @@ static void fpsimd_bind_task_to_cpu(void)
>  void fpsimd_bind_state_to_cpu(struct user_fpsimd_state *st, void *sve_state,
> unsigned int sve_vl, void *za_state,
> unsigned int sme_vl, u64 *svcr,
> -   enum fp_state *type)
> +   enum fp_state *type, enum fp_state to_save)

OK, how many discrete arguments are we going to pass to this function,
which most of them are part the vcpu structure? It really feels like
what you want is a getter for the per-cpu structure, and let the KVM
code do the actual business. If this function was supposed to provide
some 

Re: [PATCH v7 07/13] KVM: selftests: Add vm->memslots[] and enum kvm_mem_region_type

2022-09-20 Thread Ricardo Koller
On Tue, Sep 20, 2022 at 05:39:19PM +, Sean Christopherson wrote:
> On Tue, Sep 20, 2022, Ricardo Koller wrote:
> > The vm_create() helpers are hardcoded to place most page types (code,
> > page-tables, stacks, etc) in the same memslot #0, and always backed with
> > anonymous 4K.  There are a couple of issues with that.  First, tests 
> > willing to
> 
> Preferred kernel style is to wrap changelogs at ~75 chars, e.g. so that `git 
> show`
> stays under 80 chars.
> 
> And in general, please incorporate checkpatch into your workflow, e.g. there's
> also a spelling mistake below.
> 
>   WARNING: Possible unwrapped commit description (prefer a maximum 75 chars 
> per line)
>   #9: 
>   anonymous 4K.  There are a couple of issues with that.  First, tests 
> willing to
> 
>   WARNING: 'spreaded' may be misspelled - perhaps 'spread'?
>   #12: 
>   the hardcoded assumption of memslot #0 holding most things is spreaded
>   
> 
>   total: 0 errors, 2 warnings, 94 lines checked
> 
> > differ a bit, like placing page-tables in a different backing source type 
> > must
> > replicate much of what's already done by the vm_create() functions.  Second,
> > the hardcoded assumption of memslot #0 holding most things is spreaded
> > everywhere; this makes it very hard to change.
> 
> ...
> 
> > @@ -105,6 +119,13 @@ struct kvm_vm {
> >  struct userspace_mem_region *
> >  memslot2region(struct kvm_vm *vm, uint32_t memslot);
> >  
> > +inline struct userspace_mem_region *
> 
> Should be static inline.
> 
> > +vm_get_mem_region
> 
> Please don't insert newlines before the function name, it makes searching 
> painful.
> Ignore existing patterns in KVM selfetsts, they're wrong.  ;-)  Linus has a 
> nice
> explanation/rant on this[*].
> 
> The resulting declaration will run long, but at least for me, I'll take that 
> any
> day over putting the function name on a new line.
> 
> [*] 
> https://lore.kernel.org/all/CAHk-=wjoLAYG446ZNHfg=ghjsy6nfmub_wa8fyd5ilbnxjo...@mail.gmail.com
> 
> 
> > (struct kvm_vm *vm, enum kvm_mem_region_type mrt)
> 
> One last nit, what about "region" or "type" instead of "mrt"?  The acronym 
> made me
> briefly pause to figure out what "mrt" meant, which is silly because the name 
> really
> doesn't have much meaning.
> 
> > +{
> > +   assert(mrt < NR_MEM_REGIONS);
> > +   return memslot2region(vm, vm->memslots[mrt]);
> > +}
> 
> ...
> 
> > @@ -293,8 +287,16 @@ struct kvm_vm *__vm_create(enum vm_guest_mode mode, 
> > uint32_t nr_runnable_vcpus,
> > uint64_t nr_pages = vm_nr_pages_required(mode, nr_runnable_vcpus,
> >  nr_extra_pages);
> > struct kvm_vm *vm;
> > +   int i;
> > +
> > +   pr_debug("%s: mode='%s' pages='%ld'\n", __func__,
> > +vm_guest_mode_string(mode), nr_pages);
> > +
> > +   vm = vm_create(mode);
> > +   vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, 
> > 0);
> 
> The spacing is weird here.  Adding the region and stuffing vm->memslots are 
> what
> should be bundled together, not creating the VM and adding the common region. 
>  I.e.
> 
>   pr_debug("%s: mode='%s' pages='%ld'\n", __func__,
>vm_guest_mode_string(mode), nr_pages);
> 
>   vm = vm_create(mode);
> 
>   vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, 
> 0);
>   for (i = 0; i < NR_MEM_REGIONS; i++)
>   vm->memslots[i] = 0;
> 
> >  
> > -   vm = vm_create(mode, nr_pages);
> > +   for (i = 0; i < NR_MEM_REGIONS; i++)
> > +   vm->memslots[i] = 0;
> >  
> > kvm_vm_elf_load(vm, program_invocation_name);
> >  
> > -- 
> > 2.37.3.968.ga6b4b080e4-goog
> > 

Ack on all the above. Will send a v8 later today.

Thanks!
Ricardo
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v7 07/13] KVM: selftests: Add vm->memslots[] and enum kvm_mem_region_type

2022-09-20 Thread Sean Christopherson
On Tue, Sep 20, 2022, Ricardo Koller wrote:
> The vm_create() helpers are hardcoded to place most page types (code,
> page-tables, stacks, etc) in the same memslot #0, and always backed with
> anonymous 4K.  There are a couple of issues with that.  First, tests willing 
> to

Preferred kernel style is to wrap changelogs at ~75 chars, e.g. so that `git 
show`
stays under 80 chars.

And in general, please incorporate checkpatch into your workflow, e.g. there's
also a spelling mistake below.

  WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per 
line)
  #9: 
  anonymous 4K.  There are a couple of issues with that.  First, tests willing 
to

  WARNING: 'spreaded' may be misspelled - perhaps 'spread'?
  #12: 
  the hardcoded assumption of memslot #0 holding most things is spreaded
  

  total: 0 errors, 2 warnings, 94 lines checked

> differ a bit, like placing page-tables in a different backing source type must
> replicate much of what's already done by the vm_create() functions.  Second,
> the hardcoded assumption of memslot #0 holding most things is spreaded
> everywhere; this makes it very hard to change.

...

> @@ -105,6 +119,13 @@ struct kvm_vm {
>  struct userspace_mem_region *
>  memslot2region(struct kvm_vm *vm, uint32_t memslot);
>  
> +inline struct userspace_mem_region *

Should be static inline.

> +vm_get_mem_region

Please don't insert newlines before the function name, it makes searching 
painful.
Ignore existing patterns in KVM selfetsts, they're wrong.  ;-)  Linus has a nice
explanation/rant on this[*].

The resulting declaration will run long, but at least for me, I'll take that any
day over putting the function name on a new line.

[*] 
https://lore.kernel.org/all/CAHk-=wjoLAYG446ZNHfg=ghjsy6nfmub_wa8fyd5ilbnxjo...@mail.gmail.com


> (struct kvm_vm *vm, enum kvm_mem_region_type mrt)

One last nit, what about "region" or "type" instead of "mrt"?  The acronym made 
me
briefly pause to figure out what "mrt" meant, which is silly because the name 
really
doesn't have much meaning.

> +{
> + assert(mrt < NR_MEM_REGIONS);
> + return memslot2region(vm, vm->memslots[mrt]);
> +}

...

> @@ -293,8 +287,16 @@ struct kvm_vm *__vm_create(enum vm_guest_mode mode, 
> uint32_t nr_runnable_vcpus,
>   uint64_t nr_pages = vm_nr_pages_required(mode, nr_runnable_vcpus,
>nr_extra_pages);
>   struct kvm_vm *vm;
> + int i;
> +
> + pr_debug("%s: mode='%s' pages='%ld'\n", __func__,
> +  vm_guest_mode_string(mode), nr_pages);
> +
> + vm = vm_create(mode);
> + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, 
> 0);

The spacing is weird here.  Adding the region and stuffing vm->memslots are what
should be bundled together, not creating the VM and adding the common region.  
I.e.

pr_debug("%s: mode='%s' pages='%ld'\n", __func__,
 vm_guest_mode_string(mode), nr_pages);

vm = vm_create(mode);

vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, 
0);
for (i = 0; i < NR_MEM_REGIONS; i++)
vm->memslots[i] = 0;

>  
> - vm = vm_create(mode, nr_pages);
> + for (i = 0; i < NR_MEM_REGIONS; i++)
> + vm->memslots[i] = 0;
>  
>   kvm_vm_elf_load(vm, program_invocation_name);
>  
> -- 
> 2.37.3.968.ga6b4b080e4-goog
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 2/7] arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE

2022-09-20 Thread Marc Zyngier
On Mon, 15 Aug 2022 23:55:24 +0100,
Mark Brown  wrote:
> 
> When we save the state for the floating point registers this can be done
> in the form visible through either the FPSIMD V registers or the SVE Z and
> P registers. At present we track which format is currently used based on
> TIF_SVE and the SME streaming mode state but particularly in the SVE case
> this limits our options for optimising things, especially around syscalls.
> Introduce a new enum in thread_struct which explicitly states which format
> is active and keep it up to date when we change it.
> 
> At present we do not use this state except to verify that it has the
> expected value when loading the state, future patches will introduce
> functional changes.
> 
> Signed-off-by: Mark Brown 
> ---
>  arch/arm64/include/asm/fpsimd.h|  2 +-
>  arch/arm64/include/asm/kvm_host.h  |  1 +
>  arch/arm64/include/asm/processor.h |  6 
>  arch/arm64/kernel/fpsimd.c | 58 ++
>  arch/arm64/kernel/process.c|  2 ++
>  arch/arm64/kernel/ptrace.c |  3 ++
>  arch/arm64/kernel/signal.c |  7 +++-
>  arch/arm64/kvm/fpsimd.c|  3 +-
>  8 files changed, 64 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
> index c07e4abaca3d..b74103a79052 100644
> --- a/arch/arm64/include/asm/fpsimd.h
> +++ b/arch/arm64/include/asm/fpsimd.h
> @@ -61,7 +61,7 @@ extern void fpsimd_kvm_prepare(void);
>  extern void fpsimd_bind_state_to_cpu(struct user_fpsimd_state *state,
>void *sve_state, unsigned int sve_vl,
>void *za_state, unsigned int sme_vl,
> -  u64 *svcr);
> +  u64 *svcr, enum fp_state *type);
>  
>  extern void fpsimd_flush_task_state(struct task_struct *target);
>  extern void fpsimd_save_and_flush_cpu_state(void);
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index f38ef299f13b..ebd37f97aeb4 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -310,6 +310,7 @@ struct kvm_vcpu_arch {
>   void *sve_state;
>   unsigned int sve_max_vl;
>   u64 svcr;
> + enum fp_state fp_type;

Is it a state or a type? Some consistency would help. Also, what does
this represent? Your commit message keeps talking about the FP/SVE
state for the host, but this is obviously a guest-related structure.
How do the two relate?

Finally, before this patch, pahole shows this:

struct kvm_vcpu_arch {
struct kvm_cpu_context ctxt; /* 0  1824 */
/* --- cacheline 28 boundary (1792 bytes) was 32 bytes ago --- */
void * sve_state;/*  1824 8 */
unsigned int   sve_max_vl;   /*  1832 4 */

/* XXX 4 bytes hole, try to pack */

u64svcr; /*  1840 8 */
struct kvm_s2_mmu *hw_mmu;   /*  1848 8 */
[...]
};

After it, we gain an extra hole:

struct kvm_vcpu_arch {
struct kvm_cpu_context ctxt; /* 0  1824 */
/* --- cacheline 28 boundary (1792 bytes) was 32 bytes ago --- */
void * sve_state;/*  1824 8 */
unsigned int   sve_max_vl;   /*  1832 4 */

/* XXX 4 bytes hole, try to pack */

u64svcr; /*  1840 8 */
enum fp_state  fp_type;  /*  1848 4 */

/* XXX 4 bytes hole, try to pack */

/* --- cacheline 29 boundary (1856 bytes) --- */
struct kvm_s2_mmu *hw_mmu;   /*  1856 8 */
[...]
};

Packing things wouldn't hurt.

>  
>   /* Stage 2 paging state used by the hardware on next switch */
>   struct kvm_s2_mmu *hw_mmu;
> diff --git a/arch/arm64/include/asm/processor.h 
> b/arch/arm64/include/asm/processor.h
> index 86eb0bfe3b38..4818a6b77f39 100644
> --- a/arch/arm64/include/asm/processor.h
> +++ b/arch/arm64/include/asm/processor.h
> @@ -122,6 +122,11 @@ enum vec_type {
>   ARM64_VEC_MAX,
>  };
>  
> +enum fp_state {
> + FP_STATE_FPSIMD,
> + FP_STATE_SVE,
> +};
> +
>  struct cpu_context {
>   unsigned long x19;
>   unsigned long x20;
> @@ -152,6 +157,7 @@ struct thread_struct {
>   struct user_fpsimd_state fpsimd_state;
>   } uw;
>  
> + enum fp_state   fp_type;/* registers FPSIMD or SVE? */

Same comment about the state vs type.

>   unsigned intfpsimd_cpu;
>   void*sve_state; /* SVE registers, if any */
>   void*za_state;  /* ZA register, if any */
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 

Re: [PATCH v7 09/13] KVM: selftests: aarch64: Add aarch64/page_fault_test

2022-09-20 Thread Ricardo Koller
On Tue, Sep 20, 2022 at 10:05:15AM +0200, Andrew Jones wrote:
> On Tue, Sep 20, 2022 at 04:25:47AM +, Ricardo Koller wrote:
> > Add a new test for stage 2 faults when using different combinations of
> > guest accesses (e.g., write, S1PTW), backing source type (e.g., anon)
> > and types of faults (e.g., read on hugetlbfs with a hole). The next
> > commits will add different handling methods and more faults (e.g., uffd
> > and dirty logging). This first commit starts by adding two sanity checks
> > for all types of accesses: AF setting by the hw, and accessing memslots
> > with holes.
> > 
> > Signed-off-by: Ricardo Koller 
> > ---
> >  tools/testing/selftests/kvm/.gitignore|   1 +
> >  tools/testing/selftests/kvm/Makefile  |   1 +
> >  .../selftests/kvm/aarch64/page_fault_test.c   | 601 ++
> >  .../selftests/kvm/include/aarch64/processor.h |   8 +
> >  4 files changed, 611 insertions(+)
> >  create mode 100644 tools/testing/selftests/kvm/aarch64/page_fault_test.c
> > 
> > diff --git a/tools/testing/selftests/kvm/.gitignore 
> > b/tools/testing/selftests/kvm/.gitignore
> > index d625a3f83780..7a9022cfa033 100644
> > --- a/tools/testing/selftests/kvm/.gitignore
> > +++ b/tools/testing/selftests/kvm/.gitignore
> > @@ -3,6 +3,7 @@
> >  /aarch64/debug-exceptions
> >  /aarch64/get-reg-list
> >  /aarch64/hypercalls
> > +/aarch64/page_fault_test
> >  /aarch64/psci_test
> >  /aarch64/vcpu_width_config
> >  /aarch64/vgic_init
> > diff --git a/tools/testing/selftests/kvm/Makefile 
> > b/tools/testing/selftests/kvm/Makefile
> > index 1bb471aeb103..850e317b9e82 100644
> > --- a/tools/testing/selftests/kvm/Makefile
> > +++ b/tools/testing/selftests/kvm/Makefile
> > @@ -149,6 +149,7 @@ TEST_GEN_PROGS_aarch64 += aarch64/arch_timer
> >  TEST_GEN_PROGS_aarch64 += aarch64/debug-exceptions
> >  TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list
> >  TEST_GEN_PROGS_aarch64 += aarch64/hypercalls
> > +TEST_GEN_PROGS_aarch64 += aarch64/page_fault_test
> >  TEST_GEN_PROGS_aarch64 += aarch64/psci_test
> >  TEST_GEN_PROGS_aarch64 += aarch64/vcpu_width_config
> >  TEST_GEN_PROGS_aarch64 += aarch64/vgic_init
> > diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c 
> > b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
> > new file mode 100644
> > index ..98f12e7af14b
> > --- /dev/null
> > +++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
> > @@ -0,0 +1,601 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * page_fault_test.c - Test stage 2 faults.
> > + *
> > + * This test tries different combinations of guest accesses (e.g., write,
> > + * S1PTW), backing source type (e.g., anon) and types of faults (e.g., 
> > read on
> > + * hugetlbfs with a hole). It checks that the expected handling method is
> > + * called (e.g., uffd faults with the right address and write/read flag).
> > + */
> > +
> > +#define _GNU_SOURCE
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include "guest_modes.h"
> > +#include "userfaultfd_util.h"
> > +
> > +/* Guest virtual addresses that point to the test page and its PTE. */
> > +#define TEST_GVA   0xc000
> > +#define TEST_EXEC_GVA  (TEST_GVA + 0x8)
> > +#define TEST_PTE_GVA   0xb000
> > +#define TEST_DATA  0x0123456789ABCDEF
> > +
> > +static uint64_t *guest_test_memory = (uint64_t *)TEST_GVA;
> > +
> > +#define CMD_NONE   (0)
> > +#define CMD_SKIP_TEST  (1ULL << 1)
> > +#define CMD_HOLE_PT(1ULL << 2)
> > +#define CMD_HOLE_DATA  (1ULL << 3)
> > +
> > +#define PREPARE_FN_NR  10
> > +#define CHECK_FN_NR10
> > +
> > +struct test_desc {
> > +   const char *name;
> > +   uint64_t mem_mark_cmd;
> > +   /* Skip the test if any prepare function returns false */
> > +   bool (*guest_prepare[PREPARE_FN_NR])(void);
> > +   void (*guest_test)(void);
> > +   void (*guest_test_check[CHECK_FN_NR])(void);
> > +   void (*dabt_handler)(struct ex_regs *regs);
> > +   void (*iabt_handler)(struct ex_regs *regs);
> > +   uint32_t pt_memslot_flags;
> > +   uint32_t data_memslot_flags;
> > +   bool skip;
> > +};
> > +
> > +struct test_params {
> > +   enum vm_mem_backing_src_type src_type;
> > +   struct test_desc *test_desc;
> > +};
> > +
> > +static inline void flush_tlb_page(uint64_t vaddr)
> > +{
> > +   uint64_t page = vaddr >> 12;
> > +
> > +   dsb(ishst);
> > +   asm volatile("tlbi vaae1is, %0" :: "r" (page));
> > +   dsb(ish);
> > +   isb();
> > +}
> > +
> > +static void guest_write64(void)
> > +{
> > +   uint64_t val;
> > +
> > +   WRITE_ONCE(*guest_test_memory, TEST_DATA);
> > +   val = READ_ONCE(*guest_test_memory);
> > +   GUEST_ASSERT_EQ(val, TEST_DATA);
> > +}
> > +
> > +/* Check the system for 

Re: [PATCH v3 3/7] mm: Add PG_arch_3 page flag

2022-09-20 Thread Catalin Marinas
On Tue, Sep 20, 2022 at 05:33:42PM +0100, Marc Zyngier wrote:
> On Tue, 20 Sep 2022 16:39:47 +0100,
> Catalin Marinas  wrote:
> > On Mon, Sep 19, 2022 at 07:12:53PM +0100, Marc Zyngier wrote:
> > > On Mon, 05 Sep 2022 18:01:55 +0100,
> > > Catalin Marinas  wrote:
> > > > Peter, please let me know if you want to pick this series up together
> > > > with your other KVM patches. Otherwise I can post it separately, it's
> > > > worth merging it on its own as it clarifies the page flag vs tag setting
> > > > ordering.
> > > 
> > > I'm looking at queuing this, but I'm confused by this comment. Do I
> > > need to pick this as part of the series? Or is this an independent
> > > thing (my hunch is that it is actually required not to break other
> > > architectures...).
> > 
> > This series series (at least the first patches) won't apply cleanly on
> > top of 6.0-rc1 and, of course, we shouldn't break other architectures. I
> > can repost the whole series but I don't have the setup to test the
> > MAP_SHARED KVM option (unless Peter plans to post it soon).
> 
> I don't feel brave enough to take a series affecting all architectures

It shouldn't affect the others, the only change is that PG_arch_2 is now
only defined for arm64 but no other architecture is using it. The
problem with loongarch is that it doesn't have enough spare bits in
page->flags and even without any patches I think it's broken with the
right value for NR_CPUS.

> so late in the game, and the whole thing had very little arm64
> exposure. The latest QEMU doesn't seem to work anymore, so I don't
> have any MTE-capable emulation (and using the FVP remotely is a pain
> in the proverbial neck).
> 
> I'll come back to this after the merge window, should Peter decide to
> respin the series.

It makes sense.

-- 
Catalin
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 1/7] KVM: arm64: Discard any SVE state when entering KVM guests

2022-09-20 Thread Marc Zyngier
On Mon, 15 Aug 2022 23:55:23 +0100,
Mark Brown  wrote:
> 
> Since 8383741ab2e773a99 (KVM: arm64: Get rid of host SVE tracking/saving)
> KVM has not tracked the host SVE state, relying on the fact that we
> currently disable SVE whenever we perform a syscall. This may not be true
> in future since performance optimisation may result in us keeping SVE
> enabled in order to avoid needing to take access traps to reenable it.
> Handle this by clearing TIF_SVE and converting the stored task state to
> FPSIMD format when preparing to run the guest.  This is done with a new
> call fpsimd_kvm_prepare() to keep the direct state manipulation
> functions internal to fpsimd.c.
> 
> Signed-off-by: Mark Brown 
> ---
>  arch/arm64/include/asm/fpsimd.h |  1 +
>  arch/arm64/kernel/fpsimd.c  | 23 +++
>  arch/arm64/kvm/fpsimd.c |  3 ++-
>  3 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
> index 6f86b7ab6c28..c07e4abaca3d 100644
> --- a/arch/arm64/include/asm/fpsimd.h
> +++ b/arch/arm64/include/asm/fpsimd.h
> @@ -56,6 +56,7 @@ extern void fpsimd_signal_preserve_current_state(void);
>  extern void fpsimd_preserve_current_state(void);
>  extern void fpsimd_restore_current_state(void);
>  extern void fpsimd_update_current_state(struct user_fpsimd_state const 
> *state);
> +extern void fpsimd_kvm_prepare(void);
>  
>  extern void fpsimd_bind_state_to_cpu(struct user_fpsimd_state *state,
>void *sve_state, unsigned int sve_vl,
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 23834d96d1e7..549e11645e0f 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -1627,6 +1627,29 @@ void fpsimd_signal_preserve_current_state(void)
>   sve_to_fpsimd(current);
>  }
>  
> +/*
> + * Called by KVM when entering the guest.
> + */
> +void fpsimd_kvm_prepare(void)
> +{
> + if (!system_supports_sve())
> + return;
> +
> + /*
> +  * KVM does not save host SVE state since we can only enter
> +  * the guest from a syscall so the ABI means that only the
> +  * non-saved SVE state needs to be saved.  If we have left
> +  * SVE enabled for performance reasons then update the task
> +  * state to be FPSIMD only.
> +  */
> + get_cpu_fpsimd_context();
> +
> + if (test_and_clear_thread_flag(TIF_SVE))
> + sve_to_fpsimd(current);
> +
> + put_cpu_fpsimd_context();
> +}
> +
>  /*
>   * Associate current's FPSIMD context with this cpu
>   * The caller must have ownership of the cpu FPSIMD context before calling
> diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c
> index ec8e4494873d..1c1b309ef420 100644
> --- a/arch/arm64/kvm/fpsimd.c
> +++ b/arch/arm64/kvm/fpsimd.c
> @@ -75,7 +75,8 @@ int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu)
>  void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu)
>  {
>   BUG_ON(!current->mm);
> - BUG_ON(test_thread_flag(TIF_SVE));
> +
> + fpsimd_kvm_prepare();

Why is this *before* the check against system_supports_fpsimd()? I
don't think the architecture allows SVE without FP, for obvious
reasons...

>  
>   if (!system_supports_fpsimd())
>   return;

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 3/7] mm: Add PG_arch_3 page flag

2022-09-20 Thread Marc Zyngier
On Tue, 20 Sep 2022 16:39:47 +0100,
Catalin Marinas  wrote:
> 
> On Mon, Sep 19, 2022 at 07:12:53PM +0100, Marc Zyngier wrote:
> > On Mon, 05 Sep 2022 18:01:55 +0100,
> > Catalin Marinas  wrote:
> > > Peter, please let me know if you want to pick this series up together
> > > with your other KVM patches. Otherwise I can post it separately, it's
> > > worth merging it on its own as it clarifies the page flag vs tag setting
> > > ordering.
> > 
> > I'm looking at queuing this, but I'm confused by this comment. Do I
> > need to pick this as part of the series? Or is this an independent
> > thing (my hunch is that it is actually required not to break other
> > architectures...).
> 
> This series series (at least the first patches) won't apply cleanly on
> top of 6.0-rc1 and, of course, we shouldn't break other architectures. I
> can repost the whole series but I don't have the setup to test the
> MAP_SHARED KVM option (unless Peter plans to post it soon).

I don't feel brave enough to take a series affecting all architectures
so late in the game, and the whole thing had very little arm64
exposure. The latest QEMU doesn't seem to work anymore, so I don't
have any MTE-capable emulation (and using the FVP remotely is a pain
in the proverbial neck).

I'll come back to this after the merge window, should Peter decide to
respin the series.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH kernel v2] KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE support platform dependent

2022-09-20 Thread Marc Zyngier
On Tue, 20 Sep 2022 13:51:43 +0100,
Alexey Kardashevskiy  wrote:
> 
> When introduced, IRQFD resampling worked on POWER8 with XICS. However
> KVM on POWER9 has never implemented it - the compatibility mode code
> ("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native
> XIVE mode does not handle INTx in KVM at all.
> 
> This moved the capability support advertising to platforms and stops
> advertising it on XIVE, i.e. POWER9 and later.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Acked-by: Nicholas Piggin 
> [For KVM RISC-V]
> Acked-by: Anup Patel 
> ---
> Changes:
> v2:
> * removed ifdef for ARM64.

The same argument applies to both x86 and s390, which do select
HAVE_KVM_IRQFD from the KVM config option. Only power allows this
option to be selected depending on the underlying interrupt controller
emulation.

As for riscv and mips, they don't select HAVE_KVM_IRQFD, and this
isn't a user-selectable option. So why do they get patched at all?

My conclusion is that:

- only power needs the #ifdefery in the arch-specific code
- arm64, s390 and x86 can use KVM_CAP_IRQFD_RESAMPLE without a #ifdef
- mips and riscv should be left alone

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 3/7] mm: Add PG_arch_3 page flag

2022-09-20 Thread Catalin Marinas
On Mon, Sep 19, 2022 at 07:12:53PM +0100, Marc Zyngier wrote:
> On Mon, 05 Sep 2022 18:01:55 +0100,
> Catalin Marinas  wrote:
> > Peter, please let me know if you want to pick this series up together
> > with your other KVM patches. Otherwise I can post it separately, it's
> > worth merging it on its own as it clarifies the page flag vs tag setting
> > ordering.
> 
> I'm looking at queuing this, but I'm confused by this comment. Do I
> need to pick this as part of the series? Or is this an independent
> thing (my hunch is that it is actually required not to break other
> architectures...).

This series series (at least the first patches) won't apply cleanly on
top of 6.0-rc1 and, of course, we shouldn't break other architectures. I
can repost the whole series but I don't have the setup to test the
MAP_SHARED KVM option (unless Peter plans to post it soon).

-- 
Catalin
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests RFC PATCH 05/19] lib/alloc_phys: Remove locking

2022-09-20 Thread Andrew Jones
On Tue, Sep 20, 2022 at 02:20:48PM +0100, Alexandru Elisei wrote:
> Hi,
> 
> On Tue, Sep 20, 2022 at 10:45:53AM +0200, Andrew Jones wrote:
> > On Tue, Aug 09, 2022 at 10:15:44AM +0100, Alexandru Elisei wrote:
> > > With powerpc moving the page allocator, there are no architectures left
> > > which use the physical allocator after the boot setup:  arm, arm64,
> > > s390x and powerpc drain the physical allocator to initialize the page
> > > allocator; and x86 calls setup_vm() to drain the allocator for each of
> > > the tests that allocate memory.
> > 
> > Please put the motivation for this change in the commit message. I looked
> > ahead at the next patch to find it, but I'm not sure I agree with it. We
> > should be able to keep the locking even when used early, since we probably
> > need our locking to be something we can use early elsewhere anyway.
> 
> You are correct, the commit message doesn't explain why locking is removed,
> which makes the commit confusing. I will try to do a better job for the
> next iteration (if we decide to keep this patch).
> 
> I removed locking because the physical allocator by the end of the series
> will end up being used only by arm64 to create the idmap, which is done on

If only arm, and no unit tests, needs the phys allocator, then it can be
integrated with whatever arm is using it for and removed from the general
lib.

> the boot CPU and with the MMU off. After that, the translation table
> allocator functions will use the page allocator, which can be used
> concurrently.
> 
> Looking at the spinlock implementation, spin_lock() doesn't protect from
> the concurrent accesses when the MMU is disabled (lock->v is
> unconditionally set to 1). Which means that spin_lock() does not work (in
> the sense that it doesn't protect against concurrent accesses) on the boot
> path, which doesn't need a spinlock anyway, because no secondaries are
> online secondaries. It also means that spinlocks don't work when
> AUXINFO_MMU_OFF is set. So for the purpose of simplicity I preferred to
> drop it entirely.

If other architectures or unit tests have / could have uses for the
phys allocator then we should either document that it doesn't have
locks or keep the locks, and arm will just know that they don't work,
but also that they don't need to for its purposes.

Finally, if we drop the locks and arm doesn't have any other places where
we use locks without the MMU enabled, then we can change the lock
implementation to not have the no-mmu fallback - maybe by switching to the
generic implementation as the other architectures have done.

Thanks,
drew
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests RFC PATCH 05/19] lib/alloc_phys: Remove locking

2022-09-20 Thread Alexandru Elisei
Hi,

On Tue, Sep 20, 2022 at 10:45:53AM +0200, Andrew Jones wrote:
> On Tue, Aug 09, 2022 at 10:15:44AM +0100, Alexandru Elisei wrote:
> > With powerpc moving the page allocator, there are no architectures left
> > which use the physical allocator after the boot setup:  arm, arm64,
> > s390x and powerpc drain the physical allocator to initialize the page
> > allocator; and x86 calls setup_vm() to drain the allocator for each of
> > the tests that allocate memory.
> 
> Please put the motivation for this change in the commit message. I looked
> ahead at the next patch to find it, but I'm not sure I agree with it. We
> should be able to keep the locking even when used early, since we probably
> need our locking to be something we can use early elsewhere anyway.

You are correct, the commit message doesn't explain why locking is removed,
which makes the commit confusing. I will try to do a better job for the
next iteration (if we decide to keep this patch).

I removed locking because the physical allocator by the end of the series
will end up being used only by arm64 to create the idmap, which is done on
the boot CPU and with the MMU off. After that, the translation table
allocator functions will use the page allocator, which can be used
concurrently.

Looking at the spinlock implementation, spin_lock() doesn't protect from
the concurrent accesses when the MMU is disabled (lock->v is
unconditionally set to 1). Which means that spin_lock() does not work (in
the sense that it doesn't protect against concurrent accesses) on the boot
path, which doesn't need a spinlock anyway, because no secondaries are
online secondaries. It also means that spinlocks don't work when
AUXINFO_MMU_OFF is set. So for the purpose of simplicity I preferred to
drop it entirely.

Thanks,
Alex

> 
> Thanks,
> drew
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests RFC PATCH 06/19] lib/alloc_phys: Remove allocation accounting

2022-09-20 Thread Alexandru Elisei
Hi,

On Tue, Sep 20, 2022 at 10:40:47AM +0200, Andrew Jones wrote:
> On Tue, Aug 09, 2022 at 10:15:45AM +0100, Alexandru Elisei wrote:
> > The page allocator has better allocation tracking and is used by all
> > architectures, while the physical allocator is now never used for
> > allocating memory.
> > 
> > Simplify the physical allocator by removing allocation accounting. This
> > accomplishes two things:
> > 
> > 1. It makes the allocator more useful, as the warning that was displayed
> > each allocation after the 256th is removed.
> > 
> > 2. Together with the lock removal, the physical allocator becomes more
> > appealing as a very early allocator, when using the page allocator might
> > not be desirable or feasible.
> 
> How does the locking cause problems when used in an early allocator?

By "early allocator" I mean here an allocator that can be used with the MMU
off.

The "desirable or feasible" part refers to the fact that the page allocator
cannot be used an early allocator (when the MMU is off) because 1. It
doesn't do the necessary cache maintenance operations and 2. It would be
hard to do add them, as the internal structures that the page allocator
maintains are significantly more complex than what the physical allocator
uses.

With this part: "together with the lock removal, the physical allocator
becomes more appealing as a very early allocator [..]" I was trying to say
that the physical allocator has now become as simple as it can possibly be
(well, align_min could also be removed and leave it up to the calling code
to request correctly aligned allocations but it's debatable if users of the
allocator should know about how it's implemented). I can reword or remove
this part if you feel it's confusing.

Thanks,
Alex

> 
> > 
> > Also, phys_alloc_show() has received a slight change in the way it displays
> > the use and free regions: the end of the region is now non-inclusive, to
> > allow phys_alloc_show() to express that no memory has been used, or no
> > memory is free, in which case the start and the end adresses are equal.
> > 
> > Signed-off-by: Alexandru Elisei 
> > ---
> >  lib/alloc_phys.c | 65 ++--
> >  lib/alloc_phys.h |  5 ++--
> >  2 files changed, 21 insertions(+), 49 deletions(-)
> >
> 
> Reviewed-by: Andrew Jones 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH kernel v2] KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE support platform dependent

2022-09-20 Thread Alexey Kardashevskiy
When introduced, IRQFD resampling worked on POWER8 with XICS. However
KVM on POWER9 has never implemented it - the compatibility mode code
("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native
XIVE mode does not handle INTx in KVM at all.

This moved the capability support advertising to platforms and stops
advertising it on XIVE, i.e. POWER9 and later.

Signed-off-by: Alexey Kardashevskiy 
Acked-by: Nicholas Piggin 
[For KVM RISC-V]
Acked-by: Anup Patel 
---
Changes:
v2:
* removed ifdef for ARM64.
---
 arch/arm64/kvm/arm.c   | 1 +
 arch/mips/kvm/mips.c   | 3 +++
 arch/powerpc/kvm/powerpc.c | 6 ++
 arch/riscv/kvm/vm.c| 3 +++
 arch/s390/kvm/kvm-s390.c   | 3 +++
 arch/x86/kvm/x86.c | 3 +++
 virt/kvm/kvm_main.c| 1 -
 7 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 2ff0ef62abad..d2daa4d375b5 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -218,6 +218,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_VCPU_ATTRIBUTES:
case KVM_CAP_PTP_KVM:
case KVM_CAP_ARM_SYSTEM_SUSPEND:
+   case KVM_CAP_IRQFD_RESAMPLE:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index a25e0b73ee70..0f3de470a73e 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -1071,6 +1071,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_READONLY_MEM:
case KVM_CAP_SYNC_MMU:
case KVM_CAP_IMMEDIATE_EXIT:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_NR_VCPUS:
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index fb1490761c87..908ce8bd91c9 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -593,6 +593,12 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
break;
 #endif
 
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+   r = !xive_enabled();
+   break;
+#endif
+
case KVM_CAP_PPC_ALLOC_HTAB:
r = hv_enabled;
break;
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index 65a964d7e70d..0ef7a6168018 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -65,6 +65,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_READONLY_MEM:
case KVM_CAP_MP_STATE:
case KVM_CAP_IMMEDIATE_EXIT:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_NR_VCPUS:
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index edfd4bbd0cba..03037b0d1cc8 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -577,6 +577,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_SET_GUEST_DEBUG:
case KVM_CAP_S390_DIAG318:
case KVM_CAP_S390_MEM_OP_EXTENSION:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 43a6a7efc6ec..efcc02230226 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4395,6 +4395,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_VAPIC:
case KVM_CAP_ENABLE_CAP:
case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_EXIT_HYPERCALL:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 584a5bab3af3..05cf94013f02 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4447,7 +4447,6 @@ static long kvm_vm_ioctl_check_extension_generic(struct 
kvm *kvm, long arg)
 #endif
 #ifdef CONFIG_HAVE_KVM_IRQFD
case KVM_CAP_IRQFD:
-   case KVM_CAP_IRQFD_RESAMPLE:
 #endif
case KVM_CAP_IOEVENTFD_ANY_LENGTH:
case KVM_CAP_CHECK_EXTENSION_VM:
-- 
2.37.3

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests PATCH v4 07/12] arm: pmu: Basic event counter Tests

2022-09-20 Thread Zenghui Yu

Hi Eric,

On 2022/9/20 17:23, Eric Auger wrote:

Hi Zenghui,

On 9/19/22 16:30, Zenghui Yu wrote:

Hi Eric,

A few comments when looking through the PMU test code (2 years after
the series was merged).


Thank you for reviewing even after this time! Do you want to address the
issues yourself and send a patch series or do you prefer I proceed?


It'd be great if you could help to proceed. I'm afraid that I don't
have enough time to deal with it in the next few days.

Thanks,
Zenghui
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests RFC PATCH 16/19] arm/arm64: Allocate secondaries' stack using the page allocator

2022-09-20 Thread Andrew Jones
On Tue, Aug 09, 2022 at 10:15:55AM +0100, Alexandru Elisei wrote:
> The vmalloc allocator returns non-id mapped addresses, where the virtual
> address is different than the physical address. This makes it impossible
> to access the stack of the secondary CPUs while the MMU is disabled.
> 
> On arm, THREAD_SIZE is 16K and PAGE_SIZE is 4K, which makes THREAD_SIZE
> a power of two multiple of PAGE_SIZE. On arm64, THREAD_SIZE is 16 when
> PAGE_SIZE is 4K or 16K, and 64K when PAGE_SIZE is 64K. In all cases,
> THREAD_SIZE is a power of two multiple of PAGE_SIZE. As a result, using
> memalign_pages() for the stack won't lead to wasted memory.
> 
> memalign_pages() allocates memory in chunks of power of two number of
> pages, aligned to the allocation size, which makes it a drop-in
> replacement for vm_memalign (which is the value for alloc_ops->memalign
> when the stack is allocated).
> 
> Using memalign_pages() has two distinct benefits:
> 
> 1. The secondary CPUs' stack can be used with the MMU off.
> 
> 2. The secondary CPUs' stack is identify mapped similar to the stack for
> the primary CPU, which makes the configuration of the CPUs consistent.
> 
> memalign_pages_flags() has been used instead of memalign_pages() to
> instruct the allocator not to zero the stack, as it's already zeroed in the
> entry code.
> 
> Signed-off-by: Alexandru Elisei 
> ---
>  lib/arm/asm/thread_info.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/arm/asm/thread_info.h b/lib/arm/asm/thread_info.h
> index eaa72582af86..190e082cbba0 100644
> --- a/lib/arm/asm/thread_info.h
> +++ b/lib/arm/asm/thread_info.h
> @@ -25,6 +25,7 @@
>  #ifndef __ASSEMBLY__
>  #include 
>  #include 
> +#include 
>  
>  #ifdef __arm__
>  #include 
> @@ -40,7 +41,7 @@
>  
>  static inline void *thread_stack_alloc(void)
>  {
> - void *sp = memalign(THREAD_ALIGNMENT, THREAD_SIZE);
> + void *sp = memalign_pages_flags(THREAD_ALIGNMENT, THREAD_SIZE, 
> FLAG_DONTZERO);
>   return sp + THREAD_START_SP;
>  }
>  
> -- 
> 2.37.1
>

Reviewed-by: Andrew Jones 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests RFC PATCH 13/19] arm: page.h: Add missing libcflat.h include

2022-09-20 Thread Andrew Jones

I guess this should be squashed into one of the early patches in this
series since we don't have this issue with the current code.

Thanks,
drew


On Tue, Aug 09, 2022 at 10:15:52AM +0100, Alexandru Elisei wrote:
> Include libcflat from page.h to avoid error like this one:
> 
> /path/to/kvm-unit-tests/lib/asm/page.h:19:9: error: unknown type name ‘u64’
>19 | typedef u64 pteval_t;
>   | ^~~
> [..]
> /path/to/kvm-unit-tests/lib/asm/page.h:47:8: error: unknown type name 
> ‘phys_addr_t’
>47 | extern phys_addr_t __virt_to_phys(unsigned long addr);
>   |^~~
>   | ^~~
> [..]
> /path/to/kvm-unit-tests/lib/asm/page.h:50:47: error: unknown type name 
> ‘size_t’
>50 | extern void *__ioremap(phys_addr_t phys_addr, size_t size);
> 
> The arm64 version of the header already includes libcflat since commit
> a2d06852fe59 ("arm64: Add support for configuring the translation
> granule").
> 
> Signed-off-by: Alexandru Elisei 
> ---
>  lib/arm/asm/page.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/lib/arm/asm/page.h b/lib/arm/asm/page.h
> index 8eb4a883808e..0a46bda018c7 100644
> --- a/lib/arm/asm/page.h
> +++ b/lib/arm/asm/page.h
> @@ -8,6 +8,8 @@
>  
>  #include 
>  
> +#include 
> +
>  #define PAGE_SHIFT   12
>  #define PAGE_SIZE(_AC(1,UL) << PAGE_SHIFT)
>  #define PAGE_MASK(~(PAGE_SIZE-1))
> -- 
> 2.37.1
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests RFC PATCH 12/19] arm/arm64: assembler.h: Replace size with end address for dcache_by_line_op

2022-09-20 Thread Andrew Jones
On Tue, Aug 09, 2022 at 10:15:51AM +0100, Alexandru Elisei wrote:
> Commit b5f659be4775 ("arm/arm64: Remove dcache_line_size global
> variable") moved the dcache_by_line_op macro to assembler.h and changed
> it to take the size of the regions instead of the end address as
> parameter. This was done to keep the file in sync with the upstream
> Linux kernel implementation at the time.
> 
> But in both places where the macro is used, the code has the start and
> end address of the region, and it has to compute the size to pass it to
> dcache_by_line_op. Then the macro itsef computes the end by adding size
> to start.
> 
> Get rid of this massaging of parameters and change the macro to the end
> address as parameter directly.
> 
> Besides slightly simplyfing the code by remove two unneeded arithmetic
> operations, this makes the macro compatible with the current upstream
> version of Linux (which was similarly changed to take the end address in
> commit 163d3f80695e ("arm64: dcache_by_line_op to take end parameter
> instead of size")), which will allow us to reuse (part of) the Linux C
> wrappers over the assembly macro.
> 
> The change has been tested with the same snippet of code used to test
> commit 410b3bf09e76 ("arm/arm64: Perform dcache clean + invalidate after
> turning MMU off").
> 
> Signed-off-by: Alexandru Elisei 
> ---
>  arm/cstart.S  |  1 -
>  arm/cstart64.S|  1 -
>  lib/arm/asm/assembler.h   | 11 +--
>  lib/arm64/asm/assembler.h | 11 +--
>  4 files changed, 10 insertions(+), 14 deletions(-)
>

Reviewed-by: Andrew Jones 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm64: nvhe: Disable profile optimization

2022-09-20 Thread Marc Zyngier
Hi Denis,

On Tue, 20 Sep 2022 09:20:05 +0100,
Denis Nikitin  wrote:
> 
> Kernel build with -fprofile-sample-use raises the following failure:
> 
> error: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.tmp.o: Unexpected SHT_REL
> section ".rel.llvm.call-graph-profile"

How is this flag provided? I don't see any occurrence of it in the
kernel so far.

> 
> SHT_REL is generated by the latest lld, see
> https://reviews.llvm.org/rGca3bdb57fa1ac98b711a735de048c12b5fdd8086.

Is this part of a released toolchain? If so, can you spell out the
first version where this occurs?

> Disable profile optimization in kvm/nvhe to fix the build with
> AutoFDO.

It'd be good to at least mention how AutoFDO and -fprofile-sample-use
relate to each other.

> 
> Signed-off-by: Denis Nikitin 
> ---
>  arch/arm64/kvm/hyp/nvhe/Makefile | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile 
> b/arch/arm64/kvm/hyp/nvhe/Makefile
> index b5c5119c7396..6a6188374a52 100644
> --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> @@ -89,6 +89,9 @@ quiet_cmd_hypcopy = HYPCOPY $@
>  # Remove ftrace, Shadow Call Stack, and CFI CFLAGS.
>  # This is equivalent to the 'notrace', '__noscs', and '__nocfi' annotations.
>  KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_FTRACE) $(CC_FLAGS_SCS) 
> $(CC_FLAGS_CFI), $(KBUILD_CFLAGS))
> +# Profile optimization creates SHT_REL section '.llvm.call-graph-profile' for
> +# the hot code. SHT_REL is currently not supported by the KVM tools.

'KVM tools' seems vague. Maybe call out the actual helper that
processes the relocations?

> +KBUILD_CFLAGS += $(call cc-option,-fno-profile-sample-use,-fno-profile-use)

Why adding these options instead of filtering out the offending option
as it is done just above?

Also, is this the only place the kernel fails to compile? The EFI stub
does similar things AFAIR, and could potentially fail the same way.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests RFC PATCH 09/19] arm/arm64: Zero secondary CPUs' stack

2022-09-20 Thread Andrew Jones
On Tue, Aug 09, 2022 at 10:15:48AM +0100, Alexandru Elisei wrote:
> For the boot CPU, the entire stack is zeroed in the entry code. For the
> secondaries, only struct thread_info, which lives at the bottom of the
> stack, is zeroed in thread_info_init().
> 
> Be consistent and zero the entire stack for the secondaries. This should
> also improve reproducibility of the testsuite, as all the stacks now start
> with the same contents, which is zero. And now that all the stacks are
> zeroed in the entry code, there is no need to explicitely zero struct
> thread_info in thread_info_init().
> 
> Signed-off-by: Alexandru Elisei 
> ---
>  arm/cstart.S  | 6 ++
>  arm/cstart64.S| 3 +++
>  lib/arm/processor.c   | 1 -
>  lib/arm64/processor.c | 1 -
>  4 files changed, 9 insertions(+), 2 deletions(-)
>

Reviewed-by: Andrew Jones 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests PATCH v4 07/12] arm: pmu: Basic event counter Tests

2022-09-20 Thread Eric Auger
Hi Zenghui,

On 9/19/22 16:30, Zenghui Yu wrote:
> Hi Eric,
>
> A few comments when looking through the PMU test code (2 years after
> the series was merged).

Thank you for reviewing even after this time! Do you want to address the
issues yourself and send a patch series or do you prefer I proceed?

Thanks

Eric
> On 2020/4/3 15:13, Eric Auger wrote:
>> Adds the following tests:
>> - event-counter-config: test event counter configuration
>> - basic-event-count:
>>   - programs counters #0 and #1 to count 2 required events
>>   (resp. CPU_CYCLES and INST_RETIRED). Counter #0 is preset
>>   to a value close enough to the 32b
>>   overflow limit so that we check the overflow bit is set
>>   after the execution of the asm loop.
>> - mem-access: counts MEM_ACCESS event on counters #0 and #1
>>   with and without 32-bit overflow.
>>
>> Signed-off-by: Eric Auger 
>> Reviewed-by: Andre Przywara 
>
> [...]
>
>> diff --git a/arm/pmu.c b/arm/pmu.c
>> index 8c49e50..45dccf7 100644
>> --- a/arm/pmu.c
>> +++ b/arm/pmu.c
>> @@ -38,6 +43,7 @@
>>  #define SW_INCR    0x0
>>  #define INST_RETIRED    0x8
>>  #define CPU_CYCLES    0x11
>> +#define MEM_ACCESS    0x13
>>  #define INST_PREC    0x1B
>
> The spec spells event 0x001B as INST_SPEC.
>
>>  #define STALL_FRONTEND    0x23
>>  #define STALL_BACKEND    0x24
>
> [...]
>
>> @@ -175,6 +188,11 @@ static inline void precise_instrs_loop(int loop,
>> uint32_t pmcr)
>>  }
>>  
>>  #define PMCEID1_EL0 sys_reg(3, 3, 9, 12, 7)
>> +#define PMCNTENSET_EL0 sys_reg(3, 3, 9, 12, 1)
>> +#define PMCNTENCLR_EL0 sys_reg(3, 3, 9, 12, 2)
>> +
>> +#define PMEVTYPER_EXCLUDE_EL1 BIT(31)
>
> Unused macro.
>
>> +#define PMEVTYPER_EXCLUDE_EL0 BIT(30)
>>  
>>  static bool is_event_supported(uint32_t n, bool warn)
>>  {
>> @@ -228,6 +246,224 @@ static void test_event_introspection(void)
>>  report(required_events, "Check required events are implemented");
>>  }
>>  
>> +/*
>> + * Extra instructions inserted by the compiler would be difficult to
>> compensate
>> + * for, so hand assemble everything between, and including, the PMCR
>> accesses
>> + * to start and stop counting. isb instructions are inserted to make
>> sure
>> + * pmccntr read after this function returns the exact instructions
>> executed
>> + * in the controlled block. Loads @loop times the data at @address
>> into x9.
>> + */
>> +static void mem_access_loop(void *addr, int loop, uint32_t pmcr)
>> +{
>> +asm volatile(
>> +    "   msr pmcr_el0, %[pmcr]\n"
>> +    "   isb\n"
>> +    "   mov x10, %[loop]\n"
>> +    "1: sub x10, x10, #1\n"
>> +    "   ldr    x9, [%[addr]]\n"
>> +    "   cmp x10, #0x0\n"
>> +    "   b.gt    1b\n"
>> +    "   msr pmcr_el0, xzr\n"
>> +    "   isb\n"
>> +    :
>> +    : [addr] "r" (addr), [pmcr] "r" (pmcr), [loop] "r" (loop)
>> +    : "x9", "x10", "cc");
>> +}
>> +
>> +static void pmu_reset(void)
>> +{
>> +    /* reset all counters, counting disabled at PMCR level*/
>> +    set_pmcr(pmu.pmcr_ro | PMU_PMCR_LC | PMU_PMCR_C | PMU_PMCR_P);
>> +    /* Disable all counters */
>> +    write_sysreg_s(ALL_SET, PMCNTENCLR_EL0);
>> +    /* clear overflow reg */
>> +    write_sysreg(ALL_SET, pmovsclr_el0);
>> +    /* disable overflow interrupts on all counters */
>> +    write_sysreg(ALL_SET, pmintenclr_el1);
>> +    isb();
>> +}
>> +
>> +static void test_event_counter_config(void)
>> +{
>> +    int i;
>> +
>> +    if (!pmu.nb_implemented_counters) {
>> +    report_skip("No event counter, skip ...");
>> +    return;
>> +    }
>> +
>> +    pmu_reset();
>> +
>> +    /*
>> + * Test setting through PMESELR/PMXEVTYPER and PMEVTYPERn read,
>
> s/PMESELR/PMSELR/ ?
>
>> + * select counter 0
>> + */
>> +    write_sysreg(1, PMSELR_EL0);
>> +    /* program this counter to count unsupported event */
>> +    write_sysreg(0xEA, PMXEVTYPER_EL0);
>> +    write_sysreg(0xdeadbeef, PMXEVCNTR_EL0);
>> +    report((read_regn_el0(pmevtyper, 1) & 0xFFF) == 0xEA,
>> +    "PMESELR/PMXEVTYPER/PMEVTYPERn");
>
> Ditto
>
>> +    report((read_regn_el0(pmevcntr, 1) == 0xdeadbeef),
>> +    "PMESELR/PMXEVCNTR/PMEVCNTRn");
>
> Ditto
>
>> +
>> +    /* try to configure an unsupported event within the range [0x0,
>> 0x3F] */
>> +    for (i = 0; i <= 0x3F; i++) {
>> +    if (!is_event_supported(i, false))
>> +    break;
>> +    }
>> +    if (i > 0x3F) {
>> +    report_skip("pmevtyper: all events within [0x0, 0x3F] are
>> supported");
>> +    return;
>> +    }
>> +
>> +    /* select counter 0 */
>> +    write_sysreg(0, PMSELR_EL0);
>> +    /* program this counter to count unsupported event */
>
> We get the unsupported event number *i* but don't program it into
> PMXEVTYPER_EL0 (or PMEVTYPER0_EL0). Is it intentional?
>
>> +    write_sysreg(i, PMXEVCNTR_EL0);
>> +    /* read the counter value */
>> +    read_sysreg(PMXEVCNTR_EL0);
>> +    report(read_sysreg(PMXEVCNTR_EL0) == i,
>> +    "read of a counter programmed with 

Re: [kvm-unit-tests RFC PATCH 08/19] arm/arm64: Use pgd_alloc() to allocate mmu_idmap

2022-09-20 Thread Andrew Jones
On Tue, Aug 09, 2022 at 10:15:47AM +0100, Alexandru Elisei wrote:
> Until commit 031755dbfefb ("arm: enable vmalloc"), the idmap was allocated
> using pgd_alloc(). After that commit, all the page table allocator
> functions were switched to using the page allocator, but pgd_alloc() was
> left unchanged and became unused, with the idmap now being allocated with
> alloc_page().
> 
> For arm64, the pgd table size varies based on the page size, which is
> configured by the user. For arm, it will always contain 4 entries (it
> translates bits 31:30 of the input address). To keep things simple and
> consistent with the other functions and across both architectures, modify
> pgd_alloc() to use alloc_page() instead of memalign like the rest of the
> page table allocator functions and use it to allocate the idmap.
> 
> Note that when the idmap is created, alloc_ops->memalign is
> memalign_pages() which allocates memory with page granularity. Using
> memalign() as before would still have allocated a full page.
> 
> Signed-off-by: Alexandru Elisei 
> ---
>  lib/arm/asm/pgtable.h   | 4 ++--
>  lib/arm/mmu.c   | 4 ++--
>  lib/arm64/asm/pgtable.h | 4 ++--
>  3 files changed, 6 insertions(+), 6 deletions(-)
>

Reviewed-by: Andrew Jones 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests RFC PATCH 07/19] arm/arm64: Mark the phys_end parameter as unused in setup_mmu()

2022-09-20 Thread Andrew Jones
On Tue, Aug 09, 2022 at 10:15:46AM +0100, Alexandru Elisei wrote:
> phys_end was used to cap the linearly mapped memory to 3G to allow 1G of
> room for the vmalloc area to grown down. This was made useless in commit
> c1cd1a2bed69 ("arm/arm64: mmu: Remove memory layout assumptions"), when
> setup_mmu() was changed to map all the detected memory regions without
> changing their limits.

c1cd1a2bed69 was a start, but as that commit says, the 3G-4G region was
still necessary due to assumptions in the virtual memory allocator. This
patch needs to point out a vmalloc commit which removes that assumption
as well for its justification.

Thanks,
drew

> 
> Signed-off-by: Alexandru Elisei 
> ---
>  lib/arm/mmu.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/lib/arm/mmu.c b/lib/arm/mmu.c
> index e1a72fe4941f..8f936acafe8b 100644
> --- a/lib/arm/mmu.c
> +++ b/lib/arm/mmu.c
> @@ -153,14 +153,10 @@ void mmu_set_range_sect(pgd_t *pgtable, uintptr_t 
> virt_offset,
>   }
>  }
>  
> -void *setup_mmu(phys_addr_t phys_end, void *unused)
> +void *setup_mmu(phys_addr_t unused0, void *unused1)
>  {
>   struct mem_region *r;
>  
> - /* 3G-4G region is reserved for vmalloc, cap phys_end at 3G */
> - if (phys_end > (3ul << 30))
> - phys_end = 3ul << 30;
> -
>  #ifdef __aarch64__
>   init_alloc_vpage((void*)(4ul << 30));
>  
> -- 
> 2.37.1
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests RFC PATCH 05/19] lib/alloc_phys: Remove locking

2022-09-20 Thread Andrew Jones
On Tue, Aug 09, 2022 at 10:15:44AM +0100, Alexandru Elisei wrote:
> With powerpc moving the page allocator, there are no architectures left
> which use the physical allocator after the boot setup:  arm, arm64,
> s390x and powerpc drain the physical allocator to initialize the page
> allocator; and x86 calls setup_vm() to drain the allocator for each of
> the tests that allocate memory.

Please put the motivation for this change in the commit message. I looked
ahead at the next patch to find it, but I'm not sure I agree with it. We
should be able to keep the locking even when used early, since we probably
need our locking to be something we can use early elsewhere anyway.

Thanks,
drew
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests RFC PATCH 06/19] lib/alloc_phys: Remove allocation accounting

2022-09-20 Thread Andrew Jones
On Tue, Aug 09, 2022 at 10:15:45AM +0100, Alexandru Elisei wrote:
> The page allocator has better allocation tracking and is used by all
> architectures, while the physical allocator is now never used for
> allocating memory.
> 
> Simplify the physical allocator by removing allocation accounting. This
> accomplishes two things:
> 
> 1. It makes the allocator more useful, as the warning that was displayed
> each allocation after the 256th is removed.
> 
> 2. Together with the lock removal, the physical allocator becomes more
> appealing as a very early allocator, when using the page allocator might
> not be desirable or feasible.

How does the locking cause problems when used in an early allocator?

> 
> Also, phys_alloc_show() has received a slight change in the way it displays
> the use and free regions: the end of the region is now non-inclusive, to
> allow phys_alloc_show() to express that no memory has been used, or no
> memory is free, in which case the start and the end adresses are equal.
> 
> Signed-off-by: Alexandru Elisei 
> ---
>  lib/alloc_phys.c | 65 ++--
>  lib/alloc_phys.h |  5 ++--
>  2 files changed, 21 insertions(+), 49 deletions(-)
>

Reviewed-by: Andrew Jones 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests RFC PATCH 03/19] lib/alloc_phys: Use phys_alloc_aligned_safe and rename it to memalign_early

2022-09-20 Thread Andrew Jones
On Tue, Aug 09, 2022 at 10:15:42AM +0100, Alexandru Elisei wrote:
> phys_alloc_aligned_safe() is called only by early_memalign() and the safe
> parameter is always true. In the spirit of simplifying the code, merge the
> two functions together. Rename it to memalign_early(), to match the naming
> scheme used by the page allocator.
> 
> Change the type of top_safe to phys_addr_t, to match the type of the top
> and base variables describing the available physical memory; this is a
> cosmetic change only, since libcflat.h defines phys_addr_t as an alias
> for u64.
> 
> Signed-off-by: Alexandru Elisei 
> ---
>  lib/alloc_phys.c | 38 ++
>  1 file changed, 14 insertions(+), 24 deletions(-)
>

Reviewed-by: Andrew Jones 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests RFC PATCH 02/19] lib/alloc_phys: Initialize align_min

2022-09-20 Thread Andrew Jones
On Tue, Aug 09, 2022 at 10:15:41AM +0100, Alexandru Elisei wrote:
> Commit 11c4715fbf87 ("alloc: implement free") changed align_min from a
> static variable to a field for the alloc_ops struct and carried over the
> initializer value of DEFAULT_MINIMUM_ALIGNMENT.
> 
> Commit 7e3e823b78c0 ("lib/alloc.h: remove align_min from struct
> alloc_ops") removed the align_min field and changed it back to a static
> variable, but missed initializing it.
> 
> Initialize align_min to DEFAULT_MINIMUM_ALIGNMENT, as it was intended.
> 
> Signed-off-by: Alexandru Elisei 
> ---
>  lib/alloc_phys.c | 7 +++
>  lib/alloc_phys.h | 2 --
>  2 files changed, 3 insertions(+), 6 deletions(-)

Reviewed-by: Andrew Jones 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [kvm-unit-tests RFC PATCH 01/19] Makefile: Define __ASSEMBLY__ for assembly files

2022-09-20 Thread Andrew Jones
On Tue, Aug 09, 2022 at 10:15:40AM +0100, Alexandru Elisei wrote:
> There are 25 header files today (found with grep -r "#ifndef __ASSEMBLY__)
> with functionality relies on the __ASSEMBLY__ prepocessor constant being
> correctly defined to work correctly. So far, kvm-unit-tests has relied on
> the assembly files to define the constant before including any header
> files which depend on it.
> 
> Let's make sure that nobody gets this wrong and define it as a compiler
> constant when compiling assembly files. __ASSEMBLY__ is now defined for all
> .S files, even those that didn't set it explicitely before.
> 
> Signed-off-by: Alexandru Elisei 
> ---
>  Makefile   | 5 -
>  arm/cstart.S   | 1 -
>  arm/cstart64.S | 1 -
>  powerpc/cstart64.S | 1 -
>  4 files changed, 4 insertions(+), 4 deletions(-)
>

Reviewed-by: Andrew Jones 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v7 09/13] KVM: selftests: aarch64: Add aarch64/page_fault_test

2022-09-20 Thread Andrew Jones
On Tue, Sep 20, 2022 at 04:25:47AM +, Ricardo Koller wrote:
> Add a new test for stage 2 faults when using different combinations of
> guest accesses (e.g., write, S1PTW), backing source type (e.g., anon)
> and types of faults (e.g., read on hugetlbfs with a hole). The next
> commits will add different handling methods and more faults (e.g., uffd
> and dirty logging). This first commit starts by adding two sanity checks
> for all types of accesses: AF setting by the hw, and accessing memslots
> with holes.
> 
> Signed-off-by: Ricardo Koller 
> ---
>  tools/testing/selftests/kvm/.gitignore|   1 +
>  tools/testing/selftests/kvm/Makefile  |   1 +
>  .../selftests/kvm/aarch64/page_fault_test.c   | 601 ++
>  .../selftests/kvm/include/aarch64/processor.h |   8 +
>  4 files changed, 611 insertions(+)
>  create mode 100644 tools/testing/selftests/kvm/aarch64/page_fault_test.c
> 
> diff --git a/tools/testing/selftests/kvm/.gitignore 
> b/tools/testing/selftests/kvm/.gitignore
> index d625a3f83780..7a9022cfa033 100644
> --- a/tools/testing/selftests/kvm/.gitignore
> +++ b/tools/testing/selftests/kvm/.gitignore
> @@ -3,6 +3,7 @@
>  /aarch64/debug-exceptions
>  /aarch64/get-reg-list
>  /aarch64/hypercalls
> +/aarch64/page_fault_test
>  /aarch64/psci_test
>  /aarch64/vcpu_width_config
>  /aarch64/vgic_init
> diff --git a/tools/testing/selftests/kvm/Makefile 
> b/tools/testing/selftests/kvm/Makefile
> index 1bb471aeb103..850e317b9e82 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -149,6 +149,7 @@ TEST_GEN_PROGS_aarch64 += aarch64/arch_timer
>  TEST_GEN_PROGS_aarch64 += aarch64/debug-exceptions
>  TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list
>  TEST_GEN_PROGS_aarch64 += aarch64/hypercalls
> +TEST_GEN_PROGS_aarch64 += aarch64/page_fault_test
>  TEST_GEN_PROGS_aarch64 += aarch64/psci_test
>  TEST_GEN_PROGS_aarch64 += aarch64/vcpu_width_config
>  TEST_GEN_PROGS_aarch64 += aarch64/vgic_init
> diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c 
> b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
> new file mode 100644
> index ..98f12e7af14b
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
> @@ -0,0 +1,601 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * page_fault_test.c - Test stage 2 faults.
> + *
> + * This test tries different combinations of guest accesses (e.g., write,
> + * S1PTW), backing source type (e.g., anon) and types of faults (e.g., read 
> on
> + * hugetlbfs with a hole). It checks that the expected handling method is
> + * called (e.g., uffd faults with the right address and write/read flag).
> + */
> +
> +#define _GNU_SOURCE
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "guest_modes.h"
> +#include "userfaultfd_util.h"
> +
> +/* Guest virtual addresses that point to the test page and its PTE. */
> +#define TEST_GVA 0xc000
> +#define TEST_EXEC_GVA(TEST_GVA + 0x8)
> +#define TEST_PTE_GVA 0xb000
> +#define TEST_DATA0x0123456789ABCDEF
> +
> +static uint64_t *guest_test_memory = (uint64_t *)TEST_GVA;
> +
> +#define CMD_NONE (0)
> +#define CMD_SKIP_TEST(1ULL << 1)
> +#define CMD_HOLE_PT  (1ULL << 2)
> +#define CMD_HOLE_DATA(1ULL << 3)
> +
> +#define PREPARE_FN_NR10
> +#define CHECK_FN_NR  10
> +
> +struct test_desc {
> + const char *name;
> + uint64_t mem_mark_cmd;
> + /* Skip the test if any prepare function returns false */
> + bool (*guest_prepare[PREPARE_FN_NR])(void);
> + void (*guest_test)(void);
> + void (*guest_test_check[CHECK_FN_NR])(void);
> + void (*dabt_handler)(struct ex_regs *regs);
> + void (*iabt_handler)(struct ex_regs *regs);
> + uint32_t pt_memslot_flags;
> + uint32_t data_memslot_flags;
> + bool skip;
> +};
> +
> +struct test_params {
> + enum vm_mem_backing_src_type src_type;
> + struct test_desc *test_desc;
> +};
> +
> +static inline void flush_tlb_page(uint64_t vaddr)
> +{
> + uint64_t page = vaddr >> 12;
> +
> + dsb(ishst);
> + asm volatile("tlbi vaae1is, %0" :: "r" (page));
> + dsb(ish);
> + isb();
> +}
> +
> +static void guest_write64(void)
> +{
> + uint64_t val;
> +
> + WRITE_ONCE(*guest_test_memory, TEST_DATA);
> + val = READ_ONCE(*guest_test_memory);
> + GUEST_ASSERT_EQ(val, TEST_DATA);
> +}
> +
> +/* Check the system for atomic instructions. */
> +static bool guest_check_lse(void)
> +{
> + uint64_t isar0 = read_sysreg(id_aa64isar0_el1);
> + uint64_t atomic;
> +
> + atomic = FIELD_GET(ARM64_FEATURE_MASK(ID_AA64ISAR0_ATOMICS), isar0);
> + return atomic >= 2;
> +}
> +
> +static bool 

Re: [PATCH v7 08/13] KVM: selftests: Use the right memslot for code, page-tables, and data allocations

2022-09-20 Thread Andrew Jones
On Tue, Sep 20, 2022 at 04:25:46AM +, Ricardo Koller wrote:
> The previous commit added support for callers of vm_create() to specify
> what memslots to use for code, page-tables, and data allocations. Change
> them accordingly:
> 
> - stacks, code, and exception tables use the code memslot
> - page tables and the pgd use the pt memslot
> - data (anything allocated with vm_vaddr_alloc()) uses the data memslot
> 
> No functional change intended. All allocators keep using memslot #0.
> 
> Reviewed-by: Oliver Upton 
> Cc: Sean Christopherson 
> Cc: Andrew Jones 
> Signed-off-by: Ricardo Koller 
> ---
>  .../selftests/kvm/include/kvm_util_base.h |  3 +
>  .../selftests/kvm/lib/aarch64/processor.c | 11 ++--
>  tools/testing/selftests/kvm/lib/elf.c |  3 +-
>  tools/testing/selftests/kvm/lib/kvm_util.c| 57 ---
>  .../selftests/kvm/lib/riscv/processor.c   |  7 ++-
>  .../selftests/kvm/lib/s390x/processor.c   |  7 ++-
>  .../selftests/kvm/lib/x86_64/processor.c  | 13 +++--
>  7 files changed, 61 insertions(+), 40 deletions(-)
>

Reviewed-by: Andrew Jones 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v7 07/13] KVM: selftests: Add vm->memslots[] and enum kvm_mem_region_type

2022-09-20 Thread Andrew Jones
On Tue, Sep 20, 2022 at 04:25:45AM +, Ricardo Koller wrote:
> The vm_create() helpers are hardcoded to place most page types (code,
> page-tables, stacks, etc) in the same memslot #0, and always backed with
> anonymous 4K.  There are a couple of issues with that.  First, tests willing 
> to
> differ a bit, like placing page-tables in a different backing source type must
> replicate much of what's already done by the vm_create() functions.  Second,
> the hardcoded assumption of memslot #0 holding most things is spreaded
> everywhere; this makes it very hard to change.
> 
> Fix the above issues by having selftests specify how they want memory to be
> laid out. Start by changing vm_create() to not create memslot #0; a future
> commit will add test that specifies all memslots used by the VM.  Then, add 
> the
> vm->memslots[] array to specify the right memslot for different memory
> allocators, e.g.,: lib/elf should use the vm->[MEM_REGION_CODE] memslot.  This
> will be used as a way to specify the page-tables memslots (to be backed by 
> huge
> pages for example).
> 
> There is no functional change intended. The current commit lays out memory
> exactly as before. The next commit will change the allocators to get the 
> region
> they should be using, e.g.,: like the page table allocators using the pt
> memslot.
> 
> Cc: Sean Christopherson 
> Cc: Andrew Jones 
> Signed-off-by: Ricardo Koller 
> ---
>  .../selftests/kvm/include/kvm_util_base.h | 25 +--
>  tools/testing/selftests/kvm/lib/kvm_util.c| 18 +++--
>  2 files changed, 33 insertions(+), 10 deletions(-)
>

Reviewed-by: Andrew Jones 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm