[kvm-unit-tests PATCH v2 1/2] arm/arm64: mmu: Teach virt_to_pte_phys() about block descriptors

2022-11-24 Thread Alexandru Elisei
The arm and arm64 architectures allow a virtual address to be mapped using
a block descriptor (or huge page, as Linux calls it), and the function
mmu_set_ranges_sect() is made available for a test to do just that. But
virt_to_pte_phys() assumes that all virtual addresses are mapped with page
granularity, which can lead to erroneous addresses being returned in the
case of block mappings.

Signed-off-by: Alexandru Elisei 
---
 lib/arm/mmu.c | 89 +++
 1 file changed, 54 insertions(+), 35 deletions(-)

diff --git a/lib/arm/mmu.c b/lib/arm/mmu.c
index e1a72fe4941f..6022e356ddd4 100644
--- a/lib/arm/mmu.c
+++ b/lib/arm/mmu.c
@@ -111,10 +111,61 @@ pteval_t *install_page(pgd_t *pgtable, phys_addr_t phys, 
void *virt)
 __pgprot(PTE_WBWA | PTE_USER));
 }
 
-phys_addr_t virt_to_pte_phys(pgd_t *pgtable, void *mem)
+/*
+ * NOTE: The Arm architecture might require the use of a
+ * break-before-make sequence before making changes to a PTE and
+ * certain conditions are met (see Arm ARM D5-2669 for AArch64 and
+ * B3-1378 for AArch32 for more details).
+ */
+pteval_t *mmu_get_pte(pgd_t *pgtable, uintptr_t vaddr)
 {
-   return (*get_pte(pgtable, (uintptr_t)mem) & PHYS_MASK & -PAGE_SIZE)
-   + ((ulong)mem & (PAGE_SIZE - 1));
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   if (!mmu_enabled())
+   return NULL;
+
+   pgd = pgd_offset(pgtable, vaddr);
+   if (!pgd_valid(*pgd))
+   return NULL;
+
+   pud = pud_offset(pgd, vaddr);
+   if (!pud_valid(*pud))
+   return NULL;
+
+   pmd = pmd_offset(pud, vaddr);
+   if (!pmd_valid(*pmd))
+   return NULL;
+   if (pmd_huge(*pmd))
+   return _val(*pmd);
+
+   pte = pte_offset(pmd, vaddr);
+   if (!pte_valid(*pte))
+   return NULL;
+
+return _val(*pte);
+}
+
+phys_addr_t virt_to_pte_phys(pgd_t *pgtable, void *virt)
+{
+   phys_addr_t mask;
+   pteval_t *pteval;
+
+   pteval = mmu_get_pte(pgtable, (uintptr_t)virt);
+   if (!pteval) {
+   install_page(pgtable, (phys_addr_t)(unsigned long)virt, virt);
+   return (phys_addr_t)(unsigned long)virt;
+   }
+
+   if (pmd_huge(__pmd(*pteval)))
+   mask = PMD_MASK;
+   else
+   mask = PAGE_MASK;
+
+   return (*pteval & PHYS_MASK & mask) |
+   ((phys_addr_t)(unsigned long)virt & ~mask);
 }
 
 void mmu_set_range_ptes(pgd_t *pgtable, uintptr_t virt_offset,
@@ -231,38 +282,6 @@ unsigned long __phys_to_virt(phys_addr_t addr)
return addr;
 }
 
-/*
- * NOTE: The Arm architecture might require the use of a
- * break-before-make sequence before making changes to a PTE and
- * certain conditions are met (see Arm ARM D5-2669 for AArch64 and
- * B3-1378 for AArch32 for more details).
- */
-pteval_t *mmu_get_pte(pgd_t *pgtable, uintptr_t vaddr)
-{
-   pgd_t *pgd;
-   pud_t *pud;
-   pmd_t *pmd;
-   pte_t *pte;
-
-   if (!mmu_enabled())
-   return NULL;
-
-   pgd = pgd_offset(pgtable, vaddr);
-   assert(pgd_valid(*pgd));
-   pud = pud_offset(pgd, vaddr);
-   assert(pud_valid(*pud));
-   pmd = pmd_offset(pud, vaddr);
-   assert(pmd_valid(*pmd));
-
-   if (pmd_huge(*pmd))
-   return _val(*pmd);
-
-   pte = pte_offset(pmd, vaddr);
-   assert(pte_valid(*pte));
-
-return _val(*pte);
-}
-
 void mmu_clear_user(pgd_t *pgtable, unsigned long vaddr)
 {
pteval_t *p_pte = mmu_get_pte(pgtable, vaddr);
-- 
2.37.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[kvm-unit-tests PATCH v2 0/2] arm/arm64: teach virt_to_pte_phys() about block descriptors

2022-11-24 Thread Alexandru Elisei
I was writing a quick test when I noticed that arm's implementation of
__virt_to_phys(), which ends up calling virt_to_pte_phys(), doesn't handle
block mappings and returns a bogus value. When fixing it I got confused
about mmu_get_pte() and get_pte(), so I (hopefully) improved things by
renaming mmu_get_pte() to follow_pte().

Changes since v1:

- Dropped patch #1 ("lib/vmalloc: Treat virt_to_pte_phys() as returning a
  physical address") because it was incorrect.
- Dropped the check for pte_valid() for the return value of mmu_get_pte() in
  patch #1 because mmu_get_pte() returns NULL for an invalid descriptor.

Lightly tested on a rockpro64 (4k and 64k pages, arm64 and arm, qemu only)
because the changes from the previous version are trivial.

Alexandru Elisei (2):
  arm/arm64: mmu: Teach virt_to_pte_phys() about block descriptors
  arm/arm64: mmu: Rename mmu_get_pte() -> follow_pte()

 lib/arm/asm/mmu-api.h |  2 +-
 lib/arm/mmu.c | 88 +--
 2 files changed, 53 insertions(+), 37 deletions(-)

-- 
2.37.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[kvm-unit-tests PATCH v2 2/2] arm/arm64: mmu: Rename mmu_get_pte() -> follow_pte()

2022-11-24 Thread Alexandru Elisei
The function get_pte() from mmu.c returns a pointer to the PTE
associated with the requested virtual address, mapping the virtual
address in the process if it's not already mapped.

mmu_get_pte() returns a pointer to the PTE if and only if the virtual is
mapped in pgtable, otherwise returns NULL. Rename it to follow_pte() to
avoid any confusion with get_pte(). follow_pte() also matches the name
of Linux kernel function with a similar purpose.

Also remove the mmu_enabled() check from the function, as the purpose of
the function is to get the mapping for the virtual address in the pgtable
supplied as the argument, not to translate the virtual address to a
physical address using the current translation; that's what
virt_to_phys() does.

Signed-off-by: Alexandru Elisei 
---
 lib/arm/asm/mmu-api.h | 2 +-
 lib/arm/mmu.c | 9 +++--
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/lib/arm/asm/mmu-api.h b/lib/arm/asm/mmu-api.h
index 3d77cbfd8b24..6c1136d957f9 100644
--- a/lib/arm/asm/mmu-api.h
+++ b/lib/arm/asm/mmu-api.h
@@ -17,6 +17,6 @@ extern void mmu_set_range_sect(pgd_t *pgtable, uintptr_t 
virt_offset,
 extern void mmu_set_range_ptes(pgd_t *pgtable, uintptr_t virt_offset,
   phys_addr_t phys_start, phys_addr_t phys_end,
   pgprot_t prot);
-extern pteval_t *mmu_get_pte(pgd_t *pgtable, uintptr_t vaddr);
+extern pteval_t *follow_pte(pgd_t *pgtable, uintptr_t vaddr);
 extern void mmu_clear_user(pgd_t *pgtable, unsigned long vaddr);
 #endif
diff --git a/lib/arm/mmu.c b/lib/arm/mmu.c
index 6022e356ddd4..18e32b2b8927 100644
--- a/lib/arm/mmu.c
+++ b/lib/arm/mmu.c
@@ -117,16 +117,13 @@ pteval_t *install_page(pgd_t *pgtable, phys_addr_t phys, 
void *virt)
  * certain conditions are met (see Arm ARM D5-2669 for AArch64 and
  * B3-1378 for AArch32 for more details).
  */
-pteval_t *mmu_get_pte(pgd_t *pgtable, uintptr_t vaddr)
+pteval_t *follow_pte(pgd_t *pgtable, uintptr_t vaddr)
 {
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
 
-   if (!mmu_enabled())
-   return NULL;
-
pgd = pgd_offset(pgtable, vaddr);
if (!pgd_valid(*pgd))
return NULL;
@@ -153,7 +150,7 @@ phys_addr_t virt_to_pte_phys(pgd_t *pgtable, void *virt)
phys_addr_t mask;
pteval_t *pteval;
 
-   pteval = mmu_get_pte(pgtable, (uintptr_t)virt);
+   pteval = follow_pte(pgtable, (uintptr_t)virt);
if (!pteval) {
install_page(pgtable, (phys_addr_t)(unsigned long)virt, virt);
return (phys_addr_t)(unsigned long)virt;
@@ -284,7 +281,7 @@ unsigned long __phys_to_virt(phys_addr_t addr)
 
 void mmu_clear_user(pgd_t *pgtable, unsigned long vaddr)
 {
-   pteval_t *p_pte = mmu_get_pte(pgtable, vaddr);
+   pteval_t *p_pte = follow_pte(pgtable, vaddr);
if (p_pte) {
pteval_t entry = *p_pte & ~PTE_USER;
WRITE_ONCE(*p_pte, entry);
-- 
2.37.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [External] Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check

2022-11-24 Thread Usama Arif




On 18/11/2022 00:20, Marc Zyngier wrote:

On Mon, 07 Nov 2022 12:00:44 +,
Usama Arif  wrote:




On 06/11/2022 16:35, Marc Zyngier wrote:

On Fri, 04 Nov 2022 06:20:59 +,
Usama Arif  wrote:


This patchset adds support for vcpu_is_preempted in arm64, which
allows the guest to check if a vcpu was scheduled out, which is
useful to know incase it was holding a lock. vcpu_is_preempted can
be used to improve performance in locking (see owner_on_cpu usage in
mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
and osq_lock) and scheduling (see available_idle_cpu which is used
in several places in kernel/sched/fair.c for e.g. in wake_affine to
determine which CPU can run soonest):


[...]


pvcy shows a smaller overall improvement (50%) compared to
vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
~60% of the host time when using pvcy is spent in kvm_handle_wfx,
compared with ~1.5% when using vcpu_is_preempted, hence
vcpu_is_preempted shows a larger improvement.


And have you worked out *why* we spend so much time handling WFE?

M.


Its from the following change in pvcy patchset:

diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index e778eefcf214..915644816a85 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
 }

 if (esr & ESR_ELx_WFx_ISS_WFE) {
-   kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
+   int state;
+   while ((state = kvm_pvcy_check_state(vcpu)) == 0)
+   schedule();
+
+   if (state == -1)
+   kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
 } else {
 if (esr & ESR_ELx_WFx_ISS_WFxT)
 vcpu_set_flag(vcpu, IN_WFIT);


If my understanding is correct of the pvcy changes, whenever pvcy
returns an unchanged vcpu state, we would schedule to another
vcpu. And its the constant scheduling where the time is spent. I guess
the affects are much higher when the lock contention is very
high. This can be seem from the pvcy host side flamegraph as well with
(~67% of the time spent in the schedule() call in kvm_handle_wfx), For
reference, I have put the graph at:
https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg


The real issue here is that we don't try to pick the right vcpu to
run, and strictly rely on schedule() to eventually pick something that
can run.

An interesting to do would be to try and fit the directed yield
mechanism there. It would be a lot more interesting than the one-off
vcpu_is_preempted hack, as it gives us a low-level primitive on which
to construct things (pvcy is effectively a mwait-like primitive).


We could use kvm_vcpu_yield_to to yield to a specific vcpu, but how 
would we determine which vcpu to yield to?


IMO vcpu_is_preempted is very well integrated in a lot of core kernel 
code, i.e. mutex, rtmutex, rwsem and osq_lock. It is also used in 
scheduler to determine better which vCPU we can run on soonest, select 
idle core, etc. I am not sure if all of these cases will be optimized by 
pvcy? Also, with vcpu_is_preempted, some of the lock heavy benchmarks 
come down from spending around 50% of the time in lock to less than 1% 
(so not sure how much more room is there for improvement).


We could also use vcpu_is_preempted to optimize IPI performance (along 
with directed yield to target IPI vCPU) similar to how its done in x86 
(https://lore.kernel.org/all/1560255830-8656-2-git-send-email-wanpen...@tencent.com/). 
This case definitely wont be covered by pvcy.


Considering all the above, i.e. the core kernel integration already 
present and possible future usecases of vcpu_is_preempted, maybe its 
worth making vcpu_is_preempted work on arm independently of pvcy?


Thanks,
Usama



M.


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v5 0/8] KVM: arm64: permit MAP_SHARED mappings with MTE enabled

2022-11-24 Thread Marc Zyngier
On Fri, 04 Nov 2022 17:42:27 +,
Peter Collingbourne  wrote:
> 
> On Fri, Nov 4, 2022 at 9:23 AM Marc Zyngier  wrote:
> >
> > On Fri, 04 Nov 2022 01:10:33 +,
> > Peter Collingbourne  wrote:
> > >
> > > Hi,
> > >
> > > This patch series allows VMMs to use shared mappings in MTE enabled
> > > guests. The first five patches were taken from Catalin's tree [1] which
> > > addressed some review feedback from when they were previously sent out
> > > as v3 of this series. The first patch from Catalin's tree makes room
> > > for an additional PG_arch_3 flag by making the newer PG_arch_* flags
> > > arch-dependent. The next four patches are based on a series that
> > > Catalin sent out prior to v3, whose cover letter [2] I quote from below:
> > >
> > > > This series aims to fix the races between initialising the tags on a
> > > > page and setting the PG_mte_tagged flag. Currently the flag is set
> > > > either before or after that tag initialisation and this can lead to CoW
> > > > copying stale tags. The first patch moves the flag setting after the
> > > > tags have been initialised, solving the CoW issue. However, concurrent
> > > > mprotect() on a shared mapping may (very rarely) lead to valid tags
> > > > being zeroed.
> > > >
> > > > The second skips the sanitise_mte_tags() call in kvm_set_spte_gfn(),
> > > > deferring it to user_mem_abort(). The outcome is that no
> > > > sanitise_mte_tags() can be simplified to skip the pfn_to_online_page()
> > > > check and only rely on VM_MTE_ALLOWED vma flag that can be checked in
> > > > user_mem_abort().
> > > >
> > > > The third and fourth patches use PG_arch_3 as a lock for page tagging,
> > > > based on Peter Collingbourne's idea of a two-bit lock.
> > > >
> > > > I think the first patch can be queued but the rest needs some in depth
> > > > review and test. With this series (if correct) we could allos MAP_SHARED
> > > > on KVM guest memory but this is to be discussed separately as there are
> > > > some KVM ABI implications.
> > >
> > > In this v5 I rebased Catalin's tree onto -next again. Please double check
> >
> > Please don't do use -next as a base. In-flight series should be based
> > on a *stable* tag, either 6.0 or one of the early -RCs. If there is a
> > known conflict with -next, do mention it in the cover letter and
> > provide a resolution.
> 
> Okay, I will keep that in mind.
> 
> > > my rebase, which resolved the conflict with commit a8e5e5146ad0 ("arm64:
> > > mte: Avoid setting PG_mte_tagged if no tags cleared or restored").
> >
> > This commit seems part of -rc1, so I guess the patches directly apply
> > on top of that tag?
> 
> Yes, sorry, this also applies cleanly to -rc1.
> 
> > > I now have Reviewed-by for all patches except for the last one, which adds
> > > the documentation. Thanks for the reviews so far, and please take a look!
> >
> > I'd really like the MM folks (list now cc'd) to look at the relevant
> > patches (1 and 5) and ack them before I take this.
> 
> Okay, here are the lore links for the convenience of the MM folks:
> https://lore.kernel.org/all/20221104011041.290951-2-...@google.com/
> https://lore.kernel.org/all/20221104011041.290951-6-...@google.com/

I have not seen any Ack from the MM folks so far, and we're really
running out of runway for this merge window.

Short of someone shouting now, I'll take the series into the kvmarm
tree early next week.

Thanks,


-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 13/16] KVM: arm64: PMU: Implement PMUv3p5 long counter support

2022-11-24 Thread Marc Zyngier
On Wed, 23 Nov 2022 17:11:41 +,
Reiji Watanabe  wrote:
> 
> Hi Marc,
> 
> On Wed, Nov 23, 2022 at 3:11 AM Marc Zyngier  wrote:
> >
> > On Wed, 23 Nov 2022 05:58:17 +,
> > Reiji Watanabe  wrote:
> > >
> > > Hi Marc,
> > >
> > > On Sun, Nov 13, 2022 at 8:46 AM Marc Zyngier  wrote:
> > > >
> > > > PMUv3p5 (which is mandatory with ARMv8.5) comes with some extra
> > > > features:
> > > >
> > > > - All counters are 64bit
> > > >
> > > > - The overflow point is controlled by the PMCR_EL0.LP bit
> > > >
> > > > Add the required checks in the helpers that control counter
> > > > width and overflow, as well as the sysreg handling for the LP
> > > > bit. A new kvm_pmu_is_3p5() helper makes it easy to spot the
> > > > PMUv3p5 specific handling.
> > > >
> > > > Signed-off-by: Marc Zyngier 
> > > > ---
> > > >  arch/arm64/kvm/pmu-emul.c | 8 +---
> > > >  arch/arm64/kvm/sys_regs.c | 4 
> > > >  include/kvm/arm_pmu.h | 7 +++
> > > >  3 files changed, 16 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
> > > > index 4320c389fa7f..c37cc67ff1d7 100644
> > > > --- a/arch/arm64/kvm/pmu-emul.c
> > > > +++ b/arch/arm64/kvm/pmu-emul.c
> > > > @@ -52,13 +52,15 @@ static u32 kvm_pmu_event_mask(struct kvm *kvm)
> > > >   */
> > > >  static bool kvm_pmu_idx_is_64bit(struct kvm_vcpu *vcpu, u64 select_idx)
> > > >  {
> > > > -   return (select_idx == ARMV8_PMU_CYCLE_IDX);
> > > > +   return (select_idx == ARMV8_PMU_CYCLE_IDX || 
> > > > kvm_pmu_is_3p5(vcpu));
> > > >  }
> > > >
> > > >  static bool kvm_pmu_idx_has_64bit_overflow(struct kvm_vcpu *vcpu, u64 
> > > > select_idx)
> > > >  {
> > > > -   return (select_idx == ARMV8_PMU_CYCLE_IDX &&
> > > > -   __vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_LC);
> > > > +   u64 val = __vcpu_sys_reg(vcpu, PMCR_EL0);
> > > > +
> > > > +   return (select_idx < ARMV8_PMU_CYCLE_IDX && (val & 
> > > > ARMV8_PMU_PMCR_LP)) ||
> > > > +  (select_idx == ARMV8_PMU_CYCLE_IDX && (val & 
> > > > ARMV8_PMU_PMCR_LC));
> > >
> > > Since the vCPU's PMCR_EL0 value is not always in sync with
> > > kvm->arch.dfr0_pmuver.imp, shouldn't kvm_pmu_idx_has_64bit_overflow()
> > > check kvm_pmu_is_3p5() ?
> > > (e.g. when the host supports PMUv3p5, PMCR.LP will be set by reset_pmcr()
> > > initially. Then, even if userspace sets ID_AA64DFR0_EL1.PMUVER to
> > > PMUVer_V3P1, PMCR.LP will stay the same (still set) unless PMCR is
> > > written.  So, kvm_pmu_idx_has_64bit_overflow() might return true
> > > even though the guest's PMU version is lower than PMUVer_V3P5.)

I realised that reset_pmcr() cannot result in LP being set early, as
the default PMU version isn't PMUv3p5. But I'm starting to think that
we should stop playing random tricks with PMCR reset value, and make
the whole thing as straightforward as possible. TBH, the only
information we actually need from the host is PMCR_EL0.N, so we should
limit ourselves to that.

> >
> > I can see two ways to address this: either we spray PMUv3p5 checks
> > every time we evaluate PMCR, or we sanitise PMCR after each userspace
> > write to ID_AA64DFR0_EL1.
> >
> > I'd like to be able to take what is stored in the register file at
> > face value, so I'm angling towards the second possibility. It also
> 
> I thought about that too.  What makes it a bit tricky is that
> given that kvm->arch.dfr0_pmuver.imp is shared among all vCPUs
> for the guest, updating the PMCR should be done for all the vCPUs.

Yeah, good point. This really is a mess.

> > +static void update_dfr0_pmuver(struct kvm_vcpu *vcpu, u8 pmuver)
> > +{
> > +   if (vcpu->kvm->arch.dfr0_pmuver.imp != pmuver) {
> > +   vcpu->kvm->arch.dfr0_pmuver.imp = pmuver;
> > +   __reset_pmcr(vcpu, __vcpu_sys_reg(vcpu, PMCR_EL0));
> > +   }
> > +}
> 
> Or if userspace is expected to set ID_AA64DFR0_EL1 (PMUVER) for
> each vCPU, update_dfr0_pmuver() should update PMCR even when
> 'kvm->arch.dfr0_pmuver.imp' is the same as the given 'pmuver'.
> (as PMCR for the vCPU might have not been updated yet)
> 
> > makes some sense from a 'HW' perspective: you change the HW
> > dynamically by selecting a new version, the HW comes up with its reset
> > configuration (i.e don't expect PMCR to stick after you write to
> > DFR0 with a different PMUVer).
> 
> Yes, it makes some sense.
> But, with that, for live migration, PMCR should be restored only
> after ID_AA64DFR0_EL1/ID_DFR0_EL1 is restored (to avoid PMCR
> being reset after PMCR is restored), which I would think might
> affect live migration.
> (Or am I misunderstanding something ?)

You are correct. There is no way out of that anyway, as you need to
select PMUv3p5 early in order to be able to restore PMCR itself.

I *may* have a slightly better solution though, which is to piggy-back
on the call to kvm_pmu_handle_pmcr() that happens on the first run of
each vcpu. This would allow us to sanitise PMCR in the other direction