Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-03-31 Thread Paolo Bonzini

On 31/03/21 23:22, Sean Christopherson wrote:

On a related topic, any preference on whether to have an explicit "must_lock"
flag (what I posted), or derive the logic based on other params?

The helper I posted does:

if (range->must_lock &&
kvm_mmu_lock_and_check_handler(kvm, range, ))
goto out_unlock;

but it could be:

if (!IS_KVM_NULL_FN(range->on_lock) && !range->may_block &&
kvm_mmu_lock_and_check_handler(kvm, range, ))
goto out_unlock;

The generated code should be nearly identical on a modern compiler, so it's
purely a question of aesthetics.  I slightly prefer the explicit "must_lock" to
avoid spreading out the logic too much, but it also feels a bit superfluous.


I do as well, but I hope we don't need any lock after all as in the 
email I've just sent.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-03-31 Thread Paolo Bonzini

On 31/03/21 23:05, Sean Christopherson wrote:

Wouldn't it be incorrect to lock a mutex (e.g. inside*another*  MMU
notifier's invalidate callback) while holding an rwlock_t?  That makes sense
because anybody that's busy waiting in write_lock potentially cannot be
preempted until the other task gets the mutex.  This is a potential
deadlock.


Yes?  I don't think I follow your point though.  Nesting a spinlock or rwlock
inside a rwlock is ok, so long as the locks are always taken in the same order,
i.e. it's never mmu_lock -> mmu_notifier_slots_lock.


*Another* MMU notifier could nest a mutex inside KVM's rwlock.

But... is it correct that the MMU notifier invalidate callbacks are 
always called with the mmap_sem taken (sometimes for reading, e.g. 
try_to_merge_with_ksm_page->try_to_merge_one_page->write_protect_page)? 
 We could take it temporarily in install_memslots, since the MMU 
notifier's mm is stored in kvm->mm.


In this case, a pair of kvm_mmu_notifier_lock/unlock functions would be 
the best way to abstract it.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-03-31 Thread Paolo Bonzini

On 31/03/21 22:52, Sean Christopherson wrote:

100% agree with introducing on_lock separately from the conditional locking.

Not so sure about introducing conditional locking and then converting non-x86
archs.  I'd prefer to keep the conditional locking after arch conversion.
If something does go awry, it would be nice to be able to preciesly bisect to
the conditional locking.  Ditto if it needs to be reverted because it breaks an
arch.


Ok, that sounds good too.

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-03-31 Thread Paolo Bonzini

On 31/03/21 21:47, Sean Christopherson wrote:

Rereading things, a small chunk of the rwsem nastiness can go away.  I don't see
any reason to use rw_semaphore instead of rwlock_t.


Wouldn't it be incorrect to lock a mutex (e.g. inside *another* MMU 
notifier's invalidate callback) while holding an rwlock_t?  That makes 
sense because anybody that's busy waiting in write_lock potentially 
cannot be preempted until the other task gets the mutex.  This is a 
potential deadlock.


I also thought of busy waiting on down_read_trylock if the MMU notifier 
cannot block, but that would also be invalid for the opposite reason 
(the down_write task might be asleep, waiting for other readers to 
release the task, and the down_read_trylock busy loop might not let that 
task run).



And that's _already_ the worst case since notifications are currently
serialized by mmu_lock.


But right now notifications are not a single critical section, they're 
two, aren't they?


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-03-31 Thread Paolo Bonzini

On 31/03/21 22:15, Sean Christopherson wrote:

On Wed, Mar 31, 2021, Paolo Bonzini wrote:

On 26/03/21 03:19, Sean Christopherson wrote:

+   /*
+* Reset the lock used to prevent memslot updates between MMU notifier
+* range_start and range_end.  At this point no more MMU notifiers will
+* run, but the lock could still be held if KVM's notifier was removed
+* between range_start and range_end.  No threads can be waiting on the
+* lock as the last reference on KVM has been dropped.  If the lock is
+* still held, freeing memslots will deadlock.
+*/
+   init_rwsem(>mmu_notifier_slots_lock);


I was going to say that this is nasty, then I noticed that
mmu_notifier_unregister uses SRCU to ensure completion of concurrent calls
to the MMU notifier.  So I guess it's fine, but it's better to point it out:

/*
 * At this point no more MMU notifiers will run and pending
 * calls to range_start have completed, but the lock would
 * still be held and never released if the MMU notifier was
 * removed between range_start and range_end.  Since the last
 * reference to the struct kvm has been dropped, no threads can
 * be waiting on the lock, but we might still end up taking it
 * when freeing memslots in kvm_arch_destroy_vm.  Reset the lock
 * to avoid deadlocks.
 */


An alternative would be to not take the lock in install_new_memslots() if
kvm->users_count == 0.  It'd be weirder to document, and the conditional locking
would still be quite ugly.  Not sure if that's better than blasting a lock
during destruction?


No, that's worse...

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

2021-03-31 Thread Catalin Marinas
On Wed, Mar 31, 2021 at 11:41:20AM +0100, Steven Price wrote:
> On 31/03/2021 10:32, David Hildenbrand wrote:
> > On 31.03.21 11:21, Catalin Marinas wrote:
> > > On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:
> > > > On 30.03.21 12:30, Catalin Marinas wrote:
> > > > > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
> > > > > > On 28/03/2021 13:21, Catalin Marinas wrote:
> > > > > > > However, the bigger issue is that Stage 2 cannot disable
> > > > > > > tagging for Stage 1 unless the memory is Non-cacheable or
> > > > > > > Device at S2. Is there a way to detect what gets mapped in
> > > > > > > the guest as Normal Cacheable memory and make sure it's
> > > > > > > only early memory or hotplug but no ZONE_DEVICE (or
> > > > > > > something else like on-chip memory)?  If we can't
> > > > > > > guarantee that all Cacheable memory given to a guest
> > > > > > > supports tags, we should disable the feature altogether.
> > > > > > 
> > > > > > In stage 2 I believe we only have two types of mapping -
> > > > > > 'normal' or DEVICE_nGnRE (see stage2_map_set_prot_attr()).
> > > > > > Filtering out the latter is a case of checking the 'device'
> > > > > > variable, and makes sense to avoid the overhead you
> > > > > > describe.
> > > > > > 
> > > > > > This should also guarantee that all stage-2 cacheable
> > > > > > memory supports tags,
> > > > > > as kvm_is_device_pfn() is simply !pfn_valid(), and
> > > > > > pfn_valid() should only
> > > > > > be true for memory that Linux considers "normal".
> > > > 
> > > > If you think "normal" == "normal System RAM", that's wrong; see
> > > > below.
> > > 
> > > By "normal" I think both Steven and I meant the Normal Cacheable memory
> > > attribute (another being the Device memory attribute).
> 
> Sadly there's no good standardised terminology here. Aarch64 provides the
> "normal (cacheable)" definition. Memory which is mapped as "Normal
> Cacheable" is implicitly MTE capable when shared with a guest (because the
> stage 2 mappings don't allow restricting MTE other than mapping it as Device
> memory).
> 
> So MTE also forces us to have a definition of memory which is "bog standard
> memory"[1] separate from the mapping attributes. This is the main memory
> which fully supports MTE.
> 
> Separate from the "bog standard" we have the "special"[1] memory, e.g.
> ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but that
> memory may not support MTE tags. This memory can only be safely shared with
> a guest in the following situations:
> 
>  1. MTE is completely disabled for the guest
> 
>  2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE)
> 
>  3. We have some guarantee that guest MTE access are in some way safe.
> 
> (1) is the situation today (without this patch series). But it prevents the
> guest from using MTE in any form.
> 
> (2) is pretty terrible for general memory, but is the get-out clause for
> mapping devices into the guest.
> 
> (3) isn't something we have any architectural way of discovering. We'd need
> to know what the device did with the MTE accesses (and any caches between
> the CPU and the device) to ensure there aren't any side-channels or h/w
> lockup issues. We'd also need some way of describing this memory to the
> guest.
> 
> So at least for the time being the approach is to avoid letting a guest with
> MTE enabled have access to this sort of memory.

When a slot is added by the VMM, if it asked MTE in guest (I guess
that's an opt-in by the VMM, haven't checked the other patches), can we
reject it if it's is going to be mapped as Normal Cacheable but it is a
ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to
check for ZONE_DEVICE)? This way we don't need to do more expensive
checks in set_pte_at().

We could simplify the set_pte_at() further if we require that the VMM
has a PROT_MTE mapping. This does not mean it cannot have two mappings,
the other without PROT_MTE. But at least we get a set_pte_at() when
swapping in which has PROT_MTE.

We could add another PROT_TAGGED or something which means PG_mte_tagged
set but still mapped as Normal Untagged. It's just that we are short of
pte bits for another flag.

Can we somehow identify when the S2 pte is set and can we get access to
the prior swap pte? This way we could avoid changes to set_pte_at() for
S2 faults.

-- 
Catalin
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH] KVM: arm64: Support PREL/PLT relocs in EL2 code

2021-03-31 Thread David Brazdil
gen-hyprel tool parses object files of the EL2 portion of KVM
and generates runtime relocation data. While only filtering for
R_AARCH64_ABS64 relocations in the input object files, it has an
allow-list of relocation types that are used for relative
addressing. Other, unexpected, relocation types are rejected and
cause the build to fail.

This allow-list did not include the position-relative relocation
types R_AARCH64_PREL64/32/16 and the recently introduced _PLT32.
While not seen used by toolchains in the wild, add them to the
allow-list for completeness.

Fixes: 8c49b5d43d4c ("KVM: arm64: Generate hyp relocation data")
Cc: 
Reported-by: Will Deacon 
Signed-off-by: David Brazdil 
---
 arch/arm64/kvm/hyp/nvhe/gen-hyprel.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/arm64/kvm/hyp/nvhe/gen-hyprel.c 
b/arch/arm64/kvm/hyp/nvhe/gen-hyprel.c
index ead02c6a7628..6bc88a756cb7 100644
--- a/arch/arm64/kvm/hyp/nvhe/gen-hyprel.c
+++ b/arch/arm64/kvm/hyp/nvhe/gen-hyprel.c
@@ -50,6 +50,18 @@
 #ifndef R_AARCH64_ABS64
 #define R_AARCH64_ABS64257
 #endif
+#ifndef R_AARCH64_PREL64
+#define R_AARCH64_PREL64   260
+#endif
+#ifndef R_AARCH64_PREL32
+#define R_AARCH64_PREL32   261
+#endif
+#ifndef R_AARCH64_PREL16
+#define R_AARCH64_PREL16   262
+#endif
+#ifndef R_AARCH64_PLT32
+#define R_AARCH64_PLT32314
+#endif
 #ifndef R_AARCH64_LD_PREL_LO19
 #define R_AARCH64_LD_PREL_LO19 273
 #endif
@@ -371,6 +383,12 @@ static void emit_rela_section(Elf64_Shdr *sh_rela)
case R_AARCH64_ABS64:
emit_rela_abs64(rela, sh_orig_name);
break;
+   /* Allow position-relative data relocations. */
+   case R_AARCH64_PREL64:
+   case R_AARCH64_PREL32:
+   case R_AARCH64_PREL16:
+   case R_AARCH64_PLT32:
+   break;
/* Allow relocations to generate PC-relative addressing. */
case R_AARCH64_LD_PREL_LO19:
case R_AARCH64_ADR_PREL_LO21:
-- 
2.31.0.291.g576ba9dcdaf-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-03-31 Thread Paolo Bonzini

On 31/03/21 18:41, Sean Christopherson wrote:

That said, the easiest way to avoid this would be to always update
mmu_notifier_count.

Updating mmu_notifier_count requires taking mmu_lock, which would defeat the
purpose of these shenanigans.


Okay; I wasn't sure if the problem was contention with page faults in 
general, or just the long critical sections from the MMU notifier 
callbacks.  Still updating mmu_notifier_count unconditionally is a good 
way to break up the patch in two and keep one commit just for the rwsem 
nastiness.



+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+   down_write(>mmu_notifier_slots_lock);
+#endif
rcu_assign_pointer(kvm->memslots[as_id], slots);
+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+   up_write(>mmu_notifier_slots_lock);
+#endif

Please do this unconditionally, the cost is minimal if the rwsem is not
contended (as is the case if the architecture doesn't use MMU notifiers at
all).

It's not the cost, it's that mmu_notifier_slots_lock doesn't exist.  That's an
easily solved problem, but then the lock wouldn't be initialized since
kvm_init_mmu_notifier() is a nop.  That's again easy to solve, but IMO would
look rather weird.  I guess the counter argument is that __kvm_memslots()
wouldn't need #ifdeffery.


Yep.  Less #ifdefs usually wins. :)


These are the to ideas I've come up with:

Option 1:
static int kvm_init_mmu_notifier(struct kvm *kvm)
{
init_rwsem(>mmu_notifier_slots_lock);

#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
kvm->mmu_notifier.ops = _mmu_notifier_ops;
return mmu_notifier_register(>mmu_notifier, current->mm);
#else
return 0;
#endif
}


Option 2 is also okay I guess, but the simplest is option 1 + just init 
it in kvm_create_vm.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 10/18] KVM: Move x86's MMU notifier memslot walkers to generic code

2021-03-31 Thread Paolo Bonzini

On 31/03/21 18:20, Sean Christopherson wrote:

Every call
to .change_pte() is bookended by .invalidate_range_{start,end}(), i.e. the above
missing kvm->mmu_notifier_seq++ is benign because kvm->mmu_notifier_count is
guaranteed to be non-zero.


In fact ARM even relies on invalidate wrapping the change_pte handler.

/*
 * The MMU notifiers will have unmapped a huge PMD before calling
 * ->change_pte() (which in turn calls kvm_set_spte_hva()) and
 * therefore we never need to clear out a huge PMD through this
 * calling path and a memcache is not required.
 */


Assuming all of the above is correct, I'm very tempted to rip out .change_pte()
entirely.


There is still the performance benefit from immediately remapping the 
page to the new destination without waiting for a fault.  Yes it's 
hypothetical but I would prefer to leave that change for later.


The fact that the count is nonzero means that you will not even have to 
complicate kvm_mmu_notifier_change_pte to handle the removal of 
mmu_notifier_seq; just add a patch before this one to WARN if it is 
zero.  (The rest of my review to patch 16 still holds).


Paolo


It's been dead weight for 8+ years and no one has complained about
KSM+KVM performance (I'd also be curious to know how much performance was gained
by shaving VM-Exits).  As KVM is the only user of .change_pte(), dropping it in
KVM would mean the entire MMU notifier could also go away.



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2] KVM: arm64: Initialize VCPU mdcr_el2 before loading it

2021-03-31 Thread Marc Zyngier
On Wed, 31 Mar 2021 16:25:46 +0100,
Alexandru Elisei  wrote:
> 
> Hi Marc,
> 
> On 3/30/21 9:07 PM, Marc Zyngier wrote:

[...]

> > I think it would be absolutely fine to make the slow path of
> > kvm_vcpu_first_run_init() run with preempt disabled. This happens so
> > rarely that that it isn't worth thinking about it.
> 
> It looks to me like it's a bit too heavy-handed to run the entire
> function kvm_vcpu_first_run_init() with preemption disabled just for
> __this_cpu_read() in kvm_arm_setup_mdcr_el2(). Not because of the
> performance cost (it's negligible, as it's called exactly once in
> the VCPU lifetime), but because it's not obvious why it is needed.
> 
> I tried this:
> 
> @@ -580,7 +580,9 @@ static int kvm_vcpu_first_run_init(struct kvm_vcpu *vcpu)
>  
>     vcpu->arch.has_run_once = true;
>  
> -   kvm_arm_vcpu_init_debug(vcpu);
> +   preempt_disable();
> +   kvm_arm_setup_mdcr_el2(vcpu);
> +   preempt_enable();
>  
>     if (likely(irqchip_in_kernel(kvm))) {
>     /*
> 
> and it still looks a bit off to me because preemption needs to be
> disabled because of an implementation detail in
> kvm_arm_setup_mdcr_el2(), as the function operates on the VCPU
> struct and preemption can be enabled for that.
> 
> I was thinking something like this:
> 
> @@ -119,7 +119,9 @@ static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu, 
> u32
> host_mdcr)
>   */
>  void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu)
>  {
> -   kvm_arm_setup_mdcr_el2(vcpu, this_cpu_read(mdcr_el2));
> +   preempt_disable();
> +   kvm_arm_setup_mdcr_el2(vcpu);
> +   preempt_enable();
>  }
>  
>  /**
> 
> What do you think?

I'm fine with it either way, so pick you favourite!

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2] KVM: arm64: Initialize VCPU mdcr_el2 before loading it

2021-03-31 Thread Alexandru Elisei
Hi Marc,

On 3/30/21 9:07 PM, Marc Zyngier wrote:
> On Tue, 30 Mar 2021 18:13:07 +0100,
> Alexandru Elisei  wrote:
>> Hi Marc,
>>
>> Thanks for having a look!
>>
>> On 3/30/21 10:55 AM, Marc Zyngier wrote:
>>> Hi Alex,
>>>
>>> On Tue, 23 Mar 2021 18:00:57 +,
>>> Alexandru Elisei  wrote:
 When a VCPU is created, the kvm_vcpu struct is initialized to zero in
 kvm_vm_ioctl_create_vcpu(). On VHE systems, the first time
 vcpu.arch.mdcr_el2 is loaded on hardware is in vcpu_load(), before it is
 set to a sensible value in kvm_arm_setup_debug() later in the run loop. The
 result is that KVM executes for a short time with MDCR_EL2 set to zero.

 This has several unintended consequences:

 * Setting MDCR_EL2.HPMN to 0 is constrained unpredictable according to ARM
   DDI 0487G.a, page D13-3820. The behavior specified by the architecture
   in this case is for the PE to behave as if MDCR_EL2.HPMN is set to a
   value less than or equal to PMCR_EL0.N, which means that an unknown
   number of counters are now disabled by MDCR_EL2.HPME, which is zero.

 * The host configuration for the other debug features controlled by
   MDCR_EL2 is temporarily lost. This has been harmless so far, as Linux
   doesn't use the other fields, but that might change in the future.

 Let's avoid both issues by initializing the VCPU's mdcr_el2 field in
 kvm_vcpu_vcpu_first_run_init(), thus making sure that the MDCR_EL2 register
 has a consistent value after each vcpu_load().

 Signed-off-by: Alexandru Elisei 
>>> This looks strangely similar to 4942dc6638b0 ("KVM: arm64: Write
>>> arch.mdcr_el2 changes since last vcpu_load on VHE"), just at a
>>> different point. Probably worth a Fixes tag.
>> This bug is present in the commit you are mentioning, and from what
>> I can tell it's also present in the commit it's fixing (d5a21bcc2995
>> ("KVM: arm64: Move common VHE/non-VHE trap config in separate
>> functions")) - vcpu->arch.mdcr_el2 is computed in
>> kvm_arm_setup_debug(), which is called after vcpu_load(). My guess
>> is that this bug is from VHE support was added (or soon after).
> Right. Can you please add a Fixes: tag for the same commit? At least
> that'd be consistent.

Yes, I'll do that.

>
>> I can dig further, how far back in time should I aim for?
>>
 ---
 Found by code inspection. Based on v5.12-rc4.

 Tested on an odroid-c4 with VHE. vcpu->arch.mdcr_el2 is calculated to be
 0x4e66. Without this patch, reading MDCR_EL2 after the first vcpu_load() in
 kvm_arch_vcpu_ioctl_run() returns 0; with this patch it returns the correct
 value, 0xe66 (FEAT_SPE is not implemented by the PE).

 This patch was initially part of the KVM SPE series [1], but those patches
 haven't seen much activity, so I thought it would be a good idea to send
 this patch separately to draw more attention to it.

 Changes in v2:
 * Moved kvm_arm_vcpu_init_debug() earlier in kvm_vcpu_first_run_init() so
   vcpu->arch.mdcr_el2 is calculated even if kvm_vgic_map_resources() fails.
 * Added comment to kvm_arm_setup_mdcr_el2 to explain what testing
   vcpu->guest_debug means.

 [1] https://www.spinics.net/lists/kvm-arm/msg42959.html

  arch/arm64/include/asm/kvm_host.h |  1 +
  arch/arm64/kvm/arm.c  |  3 +-
  arch/arm64/kvm/debug.c| 82 +--
  3 files changed, 59 insertions(+), 27 deletions(-)

 diff --git a/arch/arm64/include/asm/kvm_host.h 
 b/arch/arm64/include/asm/kvm_host.h
 index 3d10e6527f7d..858c2fcfc043 100644
 --- a/arch/arm64/include/asm/kvm_host.h
 +++ b/arch/arm64/include/asm/kvm_host.h
 @@ -713,6 +713,7 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu 
 *vcpu, int cpu) {}
  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
  
  void kvm_arm_init_debug(void);
 +void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu);
  void kvm_arm_setup_debug(struct kvm_vcpu *vcpu);
  void kvm_arm_clear_debug(struct kvm_vcpu *vcpu);
  void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu);
 diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
 index 7f06ba76698d..7088d8fe7186 100644
 --- a/arch/arm64/kvm/arm.c
 +++ b/arch/arm64/kvm/arm.c
 @@ -580,6 +580,8 @@ static int kvm_vcpu_first_run_init(struct kvm_vcpu 
 *vcpu)
  
vcpu->arch.has_run_once = true;
  
 +  kvm_arm_vcpu_init_debug(vcpu);
 +
if (likely(irqchip_in_kernel(kvm))) {
/*
 * Map the VGIC hardware resources before running a vcpu the
 @@ -791,7 +793,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
}
  
kvm_arm_setup_debug(vcpu);
 -
>>> Spurious change?
>> Definitely, thank you for spotting it.
>>

Re: [PATCH v3 8/8] KVM: selftests: aarch64/vgic-v3 init sequence tests

2021-03-31 Thread Auger Eric
Hi Drew,

On 3/22/21 7:32 PM, Andrew Jones wrote:
> On Fri, Mar 12, 2021 at 06:32:02PM +0100, Eric Auger wrote:
>> The tests exercise the VGIC_V3 device creation including the
>> associated KVM_DEV_ARM_VGIC_GRP_ADDR group attributes:
>>
>> - KVM_VGIC_V3_ADDR_TYPE_DIST/REDIST
>> - KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION
>>
>> Some other tests dedicate to KVM_DEV_ARM_VGIC_GRP_REDIST_REGS group
>> and especially the GICR_TYPER read. The goal was to test the case
>> recently fixed by commit 23bde34771f1
>> ("KVM: arm64: vgic-v3: Drop the reporting of GICR_TYPER.Last for userspace").
>>
>> The API under test can be found at
>> Documentation/virt/kvm/devices/arm-vgic-v3.rst
>>
>> Signed-off-by: Eric Auger 
>> ---
>>  arch/arm64/kvm/vgic/vgic-mmio-v3.c|   2 +-
>>  tools/testing/selftests/kvm/Makefile  |   1 +
>>  .../testing/selftests/kvm/aarch64/vgic_init.c | 672 ++
>>  .../testing/selftests/kvm/include/kvm_util.h  |   5 +
>>  tools/testing/selftests/kvm/lib/kvm_util.c|  51 ++
>>  5 files changed, 730 insertions(+), 1 deletion(-)
>>  create mode 100644 tools/testing/selftests/kvm/aarch64/vgic_init.c
>>
>> diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v3.c 
>> b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
>> index 652998ed0b55..f6a7eed1d6ad 100644
>> --- a/arch/arm64/kvm/vgic/vgic-mmio-v3.c
>> +++ b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
>> @@ -260,7 +260,7 @@ static bool vgic_mmio_vcpu_rdist_is_last(struct kvm_vcpu 
>> *vcpu)
>>  if (!rdreg)
>>  return false;
>>  
>> -if (!rdreg->count && vgic_cpu->rdreg_index == (rdreg->count - 1)) {
>> +if (rdreg->count && vgic_cpu->rdreg_index == (rdreg->count - 1)) {
> 
> I guess this is an accidental change?
this change should be squashed into the previous patch
> 
>>  /* check whether there is no other contiguous rdist region */
>>  struct list_head *rd_regions = >rd_regions;
>>  struct vgic_redist_region *iter;
>> diff --git a/tools/testing/selftests/kvm/Makefile 
>> b/tools/testing/selftests/kvm/Makefile
>> index a6d61f451f88..4e548d7ab0ab 100644
>> --- a/tools/testing/selftests/kvm/Makefile
>> +++ b/tools/testing/selftests/kvm/Makefile
>> @@ -75,6 +75,7 @@ TEST_GEN_PROGS_x86_64 += steal_time
>>  
>>  TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list
>>  TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list-sve
>> +TEST_GEN_PROGS_aarch64 += aarch64/vgic_init
>>  TEST_GEN_PROGS_aarch64 += demand_paging_test
>>  TEST_GEN_PROGS_aarch64 += dirty_log_test
>>  TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
> 
> Missing .gitignore change
OK
> 
>> diff --git a/tools/testing/selftests/kvm/aarch64/vgic_init.c 
>> b/tools/testing/selftests/kvm/aarch64/vgic_init.c
>> new file mode 100644
>> index ..12205ab9fb10
>> --- /dev/null
>> +++ b/tools/testing/selftests/kvm/aarch64/vgic_init.c
>> @@ -0,0 +1,672 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * vgic init sequence tests
>> + *
>> + * Copyright (C) 2020, Red Hat, Inc.
>> + */
>> +#define _GNU_SOURCE
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include "test_util.h"
>> +#include "kvm_util.h"
>> +#include "processor.h"
>> +
>> +#define NR_VCPUS4
>> +
>> +#define REDIST_REGION_ATTR_ADDR(count, base, flags, index) 
>> (((uint64_t)(count) << 52) | \
>> +((uint64_t)((base) >> 16) << 16) | ((uint64_t)(flags) << 12) | index)
>> +#define REG_OFFSET(vcpu, offset) (((uint64_t)vcpu << 32) | offset)
>> +
>> +#define GICR_TYPER 0x8
>> +
>> +/* helper to access a redistributor register */
>> +static int access_redist_reg(int gicv3_fd, int vcpu, int offset,
>> + uint32_t *val, bool write)
>> +{
>> +uint64_t attr = REG_OFFSET(vcpu, offset);
>> +
>> +return kvm_device_access(gicv3_fd, KVM_DEV_ARM_VGIC_GRP_REDIST_REGS,
>> + attr, val, write);
>> +}
>> +
>> +/* dummy guest code */
>> +static void guest_code(int cpu)
> 
> cpu is unused, no need for it
sure
> 
>> +{
>> +GUEST_SYNC(0);
>> +GUEST_SYNC(1);
>> +GUEST_SYNC(2);
>> +GUEST_DONE();
>> +}
>> +
>> +/* we don't want to assert on run execution, hence that helper */
>> +static int run_vcpu(struct kvm_vm *vm, uint32_t vcpuid)
>> +{
>> +static int run;
>> +struct ucall uc;
>> +int ret;
>> +
>> +vcpu_args_set(vm, vcpuid, 1, vcpuid);
> 
> The cpu index is unused, so you need to pass it in
removed
> 
>> +ret = _vcpu_ioctl(vm, vcpuid, KVM_RUN, NULL);
>> +get_ucall(vm, vcpuid, );
> 
> uc is unused, so you can pass NULL for it
OK
> 
>> +run++;
> 
> What good is this counter? Nobody reads it.
removed
> 
>> +
>> +if (ret)
>> +return -errno;
>> +return 0;
>> +}
>> +
>> +/**
>> + * Helper routine that performs KVM device tests in general and
>> + * especially ARM_VGIC_V3 ones. Eventually the ARM_VGIC_V3
>> + * device gets created, a legacy RDIST region is set at @0x0
>> + * and a DIST region is set @0x6
>> + */
>> +int fuzz_dist_rdist(struct kvm_vm *vm)
> 
> 

Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

2021-03-31 Thread David Hildenbrand

On 31.03.21 12:41, Steven Price wrote:

On 31/03/2021 10:32, David Hildenbrand wrote:

On 31.03.21 11:21, Catalin Marinas wrote:

On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:

On 30.03.21 12:30, Catalin Marinas wrote:

On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:

On 28/03/2021 13:21, Catalin Marinas wrote:

On Sat, Mar 27, 2021 at 03:23:24PM +, Catalin Marinas wrote:

On Fri, Mar 12, 2021 at 03:18:58PM +, Steven Price wrote:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..b31b7a821f90 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu
*vcpu, phys_addr_t fault_ipa,
     if (vma_pagesize == PAGE_SIZE && !force_pte)
     vma_pagesize = transparent_hugepage_adjust(memslot,
hva,
    , _ipa);
+
+    if (fault_status != FSC_PERM && kvm_has_mte(kvm) &&
pfn_valid(pfn)) {
+    /*
+ * VM will be able to see the page's tags, so we must
ensure
+ * they have been initialised. if PG_mte_tagged is set,
tags
+ * have already been initialised.
+ */
+    struct page *page = pfn_to_page(pfn);
+    unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
+
+    for (i = 0; i < nr_pages; i++, page++) {
+    if (!test_and_set_bit(PG_mte_tagged, >flags))
+    mte_clear_page_tags(page_address(page));
+    }
+    }


This pfn_valid() check may be problematic. Following commit
eeb0753ba27b
("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it
returns
true for ZONE_DEVICE memory but such memory is allowed not to
support
MTE.


Some more thinking, this should be safe as any ZONE_DEVICE would be
mapped as untagged memory in the kernel linear map. It could be
slightly
inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
untagged memory. Another overhead is pfn_valid() which will likely
end
up calling memblock_is_map_memory().

However, the bigger issue is that Stage 2 cannot disable tagging for
Stage 1 unless the memory is Non-cacheable or Device at S2. Is
there a
way to detect what gets mapped in the guest as Normal Cacheable
memory
and make sure it's only early memory or hotplug but no ZONE_DEVICE
(or
something else like on-chip memory)?  If we can't guarantee that all
Cacheable memory given to a guest supports tags, we should disable
the
feature altogether.


In stage 2 I believe we only have two types of mapping - 'normal' or
DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the
latter is a
case of checking the 'device' variable, and makes sense to avoid the
overhead you describe.

This should also guarantee that all stage-2 cacheable memory
supports tags,
as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid()
should only
be true for memory that Linux considers "normal".


If you think "normal" == "normal System RAM", that's wrong; see below.


By "normal" I think both Steven and I meant the Normal Cacheable memory
attribute (another being the Device memory attribute).


Sadly there's no good standardised terminology here. Aarch64 provides
the "normal (cacheable)" definition. Memory which is mapped as "Normal
Cacheable" is implicitly MTE capable when shared with a guest (because
the stage 2 mappings don't allow restricting MTE other than mapping it
as Device memory).

So MTE also forces us to have a definition of memory which is "bog
standard memory"[1] separate from the mapping attributes. This is the
main memory which fully supports MTE.

Separate from the "bog standard" we have the "special"[1] memory, e.g.
ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but
that memory may not support MTE tags. This memory can only be safely
shared with a guest in the following situations:

   1. MTE is completely disabled for the guest

   2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE)

   3. We have some guarantee that guest MTE access are in some way safe.

(1) is the situation today (without this patch series). But it prevents
the guest from using MTE in any form.

(2) is pretty terrible for general memory, but is the get-out clause for
mapping devices into the guest.

(3) isn't something we have any architectural way of discovering. We'd
need to know what the device did with the MTE accesses (and any caches
between the CPU and the device) to ensure there aren't any side-channels
or h/w lockup issues. We'd also need some way of describing this memory
to the guest.

So at least for the time being the approach is to avoid letting a guest
with MTE enabled have access to this sort of memory.

[1] Neither "bog standard" nor "special" are real terms - like I said
there's a lack of standardised terminology.


That's the problem. With Anshuman's commit I mentioned above,
pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
memory, not talking about some I/O mapping that requires Device_nGnRE).
So 

Re: [PATCH] KVM: arm64: Support PREL/PLT relocs in EL2 code

2021-03-31 Thread Marc Zyngier
On Wed, 31 Mar 2021 13:30:48 +, David Brazdil wrote:
> gen-hyprel tool parses object files of the EL2 portion of KVM
> and generates runtime relocation data. While only filtering for
> R_AARCH64_ABS64 relocations in the input object files, it has an
> allow-list of relocation types that are used for relative
> addressing. Other, unexpected, relocation types are rejected and
> cause the build to fail.
> 
> [...]

Applied to kvm-arm64/misc-5.13, thanks!

[1/1] KVM: arm64: Support PREL/PLT relocs in EL2 code
  commit: 77e06b300161d41d65950be9c77a785c142b381d

Cheers,

M.
-- 
Without deviation from the norm, progress is not possible.


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm64: Elect Alexandru as a replacement for Julien as a reviewer

2021-03-31 Thread Alexandru Elisei
Hi Marc,

On 3/31/21 2:16 PM, Marc Zyngier wrote:
> Julien's bandwidth for KVM reviewing has been pretty low lately,
> and Alexandru has accepted to step in and help with the reviewing.
>
> Many thanks to both!

Happy to help!

Acked-by: Alexandru Elisei 

Thanks,

Alex

>
> Cc: Julien Thierry 
> Cc: Alexandru Elisei 
> Signed-off-by: Marc Zyngier 
> ---
>  MAINTAINERS | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index aa84121c5611..803bd0551512 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -9765,7 +9765,7 @@ F:  virt/kvm/*
>  KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64)
>  M:   Marc Zyngier 
>  R:   James Morse 
> -R:   Julien Thierry 
> +R:   Alexandru Elisei 
>  R:   Suzuki K Poulose 
>  L:   linux-arm-ker...@lists.infradead.org (moderated for non-subscribers)
>  L:   kvmarm@lists.cs.columbia.edu
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH] KVM: arm64: Elect Alexandru as a replacement for Julien as a reviewer

2021-03-31 Thread Marc Zyngier
Julien's bandwidth for KVM reviewing has been pretty low lately,
and Alexandru has accepted to step in and help with the reviewing.

Many thanks to both!

Cc: Julien Thierry 
Cc: Alexandru Elisei 
Signed-off-by: Marc Zyngier 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index aa84121c5611..803bd0551512 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9765,7 +9765,7 @@ F:virt/kvm/*
 KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64)
 M: Marc Zyngier 
 R: James Morse 
-R: Julien Thierry 
+R: Alexandru Elisei 
 R: Suzuki K Poulose 
 L: linux-arm-ker...@lists.infradead.org (moderated for non-subscribers)
 L: kvmarm@lists.cs.columbia.edu
-- 
2.30.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2] KVM: arm64: Initialize VCPU mdcr_el2 before loading it

2021-03-31 Thread Alexandru Elisei
Hi Marc,

On 3/30/21 8:57 PM, Marc Zyngier wrote:
> On Tue, 30 Mar 2021 18:49:54 +0100,
> Alexandru Elisei  wrote:
>> Hi Marc,
>>
>> On 3/30/21 6:13 PM, Alexandru Elisei wrote:
>>> [..]
> +}
> +
>  /**
>   * kvm_arm_reset_debug_ptr - reset the debug ptr to point to the vcpu 
> state
>   */
> @@ -83,12 +137,7 @@ void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu)
>   * @vcpu:the vcpu pointer
>   *
>   * This is called before each entry into the hypervisor to setup any
> - * debug related registers. Currently this just ensures we will trap
> - * access to:
> - *  - Performance monitors (MDCR_EL2_TPM/MDCR_EL2_TPMCR)
> - *  - Debug ROM Address (MDCR_EL2_TDRA)
> - *  - OS related registers (MDCR_EL2_TDOSA)
> - *  - Statistical profiler (MDCR_EL2_TPMS/MDCR_EL2_E2PB)
> + * debug related registers.
>   *
>   * Additionally, KVM only traps guest accesses to the debug registers if
>   * the guest is not actively using them (see the KVM_ARM64_DEBUG_DIRTY
> @@ -100,27 +149,14 @@ void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu)
>  
>  void kvm_arm_setup_debug(struct kvm_vcpu *vcpu)
>  {
> - bool trap_debug = !(vcpu->arch.flags & KVM_ARM64_DEBUG_DIRTY);
>   unsigned long mdscr, orig_mdcr_el2 = vcpu->arch.mdcr_el2;
>  
>   trace_kvm_arm_setup_debug(vcpu, vcpu->guest_debug);
>  
> - /*
> -  * This also clears MDCR_EL2_E2PB_MASK to disable guest access
> -  * to the profiling buffer.
> -  */
> - vcpu->arch.mdcr_el2 = __this_cpu_read(mdcr_el2) & MDCR_EL2_HPMN_MASK;
> - vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
> - MDCR_EL2_TPMS |
> - MDCR_EL2_TPMCR |
> - MDCR_EL2_TDRA |
> - MDCR_EL2_TDOSA);
> + kvm_arm_setup_mdcr_el2(vcpu, __this_cpu_read(mdcr_el2));
>  
>   /* Is Guest debugging in effect? */
>   if (vcpu->guest_debug) {
> - /* Route all software debug exceptions to EL2 */
> - vcpu->arch.mdcr_el2 |= MDCR_EL2_TDE;
> -
>   /* Save guest debug state */
>   save_guest_debug_regs(vcpu);
>  
> @@ -174,7 +210,6 @@ void kvm_arm_setup_debug(struct kvm_vcpu *vcpu)
>  
>   vcpu->arch.debug_ptr = >arch.external_debug_state;
>   vcpu->arch.flags |= KVM_ARM64_DEBUG_DIRTY;
> - trap_debug = true;
 There is something that slightly worries me here: there is now a
 disconnect between flagging debug as dirty and setting the
 trapping. And actually, you now check for KVM_ARM64_DEBUG_DIRTY and
 set the trap bits *before* setting the dirty bit itself.

 Here, I believe you end up with guest/host confusion of breakpoints,
 which isn't great. Or did I miss something?
>>> I'm sorry, but I don't understand what you mean. This is my understanding 
>>> of what
>>> is happening.
>>>
>>> Without this patch, trap_debug is set to true and the KVM_ARM64_DEBUG_DIRTY 
>>> flag
>>> is set if vcpu->guest_debug & KVM_GUESTDBG_USE_HW. Further down, trap debug 
>>> is
>>> only used when computing mdcr_el2.
>>>
>>> With this patch, trap_debug is set to true if vcpu->guest_debug &
>>> KVM_GUESTDBG_USE_HW and it's also used for computing mdcr_el2, but this 
>>> happens in
>>> kvm_arm_setup_mdcr_el2(), which is called at the start of 
>>> kvm_arm_setup_debug().
>>> The KVM_ARM_DEBUG_DIRTY flags is still set in kvm_arm_setup_debug() if
>>> vcpu->guest_debug & KVM_GUESTDBG_USE_HW, like before.
>>>
>>> The guest never runs with the value computed in kvm_vcpu_first_run_init() 
>>> unless
>>> it's identical with the value recomputed in kvm_arm_setup_debug().
>>>
>>> The only difference I see is that mdcr_el2 is computed at the start of
>>> kvm_arm_setup_debug(). I get the feeling I'm also missing something.
>> I think I understand what you mean, you are worried that we won't
>> set the bit in mdcr_el2 to trap debug in the same place where we set
>> the debug dirty flag.
> Yes, that's what I mean. The code is conceptually as such ATM:
>
>   debug_trap = (something based on vcpu->flags);
>   if (something else) {
>   check stuff;
>   vcpu->flags |= stuff;
>   debug_trap = true;
>   }
>
>   if (debug_trap)
>   set trap conditions;
>
> You are turning this into:
>
>   debug_trap = (something based on vcpu->flags);
>   if (debug_trap) {
>   set trap conditions;
>   }
>   if (something else) {
>   check stuff;
>   vcpu->flags |= stuff;
>   }
>
> which isn't the same thing. In your case, it probably works because of
> KVM_GUESTDBG_USE_HW, but that's really hard to follow, and we have had
> so many bugs in the debug code that it really needs to be kept as
> stupid as possible.
>
>> If that's the case, then I can move kvm_arm_setup_mdcr_el2 

Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

2021-03-31 Thread Steven Price

On 31/03/2021 10:32, David Hildenbrand wrote:

On 31.03.21 11:21, Catalin Marinas wrote:

On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:

On 30.03.21 12:30, Catalin Marinas wrote:

On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:

On 28/03/2021 13:21, Catalin Marinas wrote:

On Sat, Mar 27, 2021 at 03:23:24PM +, Catalin Marinas wrote:

On Fri, Mar 12, 2021 at 03:18:58PM +, Steven Price wrote:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..b31b7a821f90 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu 
*vcpu, phys_addr_t fault_ipa,

    if (vma_pagesize == PAGE_SIZE && !force_pte)
    vma_pagesize = transparent_hugepage_adjust(memslot, 
hva,

   , _ipa);
+
+    if (fault_status != FSC_PERM && kvm_has_mte(kvm) && 
pfn_valid(pfn)) {

+    /*
+ * VM will be able to see the page's tags, so we must 
ensure
+ * they have been initialised. if PG_mte_tagged is set, 
tags

+ * have already been initialised.
+ */
+    struct page *page = pfn_to_page(pfn);
+    unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
+
+    for (i = 0; i < nr_pages; i++, page++) {
+    if (!test_and_set_bit(PG_mte_tagged, >flags))
+    mte_clear_page_tags(page_address(page));
+    }
+    }


This pfn_valid() check may be problematic. Following commit 
eeb0753ba27b
("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it 
returns
true for ZONE_DEVICE memory but such memory is allowed not to 
support

MTE.


Some more thinking, this should be safe as any ZONE_DEVICE would be
mapped as untagged memory in the kernel linear map. It could be 
slightly

inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
untagged memory. Another overhead is pfn_valid() which will likely 
end

up calling memblock_is_map_memory().

However, the bigger issue is that Stage 2 cannot disable tagging for
Stage 1 unless the memory is Non-cacheable or Device at S2. Is 
there a
way to detect what gets mapped in the guest as Normal Cacheable 
memory
and make sure it's only early memory or hotplug but no ZONE_DEVICE 
(or

something else like on-chip memory)?  If we can't guarantee that all
Cacheable memory given to a guest supports tags, we should disable 
the

feature altogether.


In stage 2 I believe we only have two types of mapping - 'normal' or
DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the 
latter is a

case of checking the 'device' variable, and makes sense to avoid the
overhead you describe.

This should also guarantee that all stage-2 cacheable memory 
supports tags,
as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() 
should only

be true for memory that Linux considers "normal".


If you think "normal" == "normal System RAM", that's wrong; see below.


By "normal" I think both Steven and I meant the Normal Cacheable memory
attribute (another being the Device memory attribute).


Sadly there's no good standardised terminology here. Aarch64 provides 
the "normal (cacheable)" definition. Memory which is mapped as "Normal 
Cacheable" is implicitly MTE capable when shared with a guest (because 
the stage 2 mappings don't allow restricting MTE other than mapping it 
as Device memory).


So MTE also forces us to have a definition of memory which is "bog 
standard memory"[1] separate from the mapping attributes. This is the 
main memory which fully supports MTE.


Separate from the "bog standard" we have the "special"[1] memory, e.g. 
ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but 
that memory may not support MTE tags. This memory can only be safely 
shared with a guest in the following situations:


 1. MTE is completely disabled for the guest

 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE)

 3. We have some guarantee that guest MTE access are in some way safe.

(1) is the situation today (without this patch series). But it prevents 
the guest from using MTE in any form.


(2) is pretty terrible for general memory, but is the get-out clause for 
mapping devices into the guest.


(3) isn't something we have any architectural way of discovering. We'd 
need to know what the device did with the MTE accesses (and any caches 
between the CPU and the device) to ensure there aren't any side-channels 
or h/w lockup issues. We'd also need some way of describing this memory 
to the guest.


So at least for the time being the approach is to avoid letting a guest 
with MTE enabled have access to this sort of memory.


[1] Neither "bog standard" nor "special" are real terms - like I said 
there's a lack of standardised terminology.



That's the problem. With Anshuman's commit I mentioned above,
pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
memory, not talking about some I/O mapping that requires Device_nGnRE).
So 

Re: [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged

2021-03-31 Thread Steven Price

On 30/03/2021 11:13, Catalin Marinas wrote:

On Mon, Mar 29, 2021 at 04:55:29PM +0100, Steven Price wrote:

On 26/03/2021 18:56, Catalin Marinas wrote:

On Fri, Mar 12, 2021 at 03:18:57PM +, Steven Price wrote:

A KVM guest could store tags in a page even if the VMM hasn't mapped
the page with PROT_MTE. So when restoring pages from swap we will
need to check to see if there are any saved tags even if !pte_tagged().

However don't check pages which are !pte_valid_user() as these will
not have been swapped out.

Signed-off-by: Steven Price 
---
   arch/arm64/include/asm/pgtable.h |  2 +-
   arch/arm64/kernel/mte.c  | 16 
   2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e17b96d0e4b5..84166625c989 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, 
unsigned long addr,
__sync_icache_dcache(pte);
if (system_supports_mte() &&
-   pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
+   pte_present(pte) && pte_valid_user(pte) && !pte_special(pte))
mte_sync_tags(ptep, pte);


With the EPAN patches queued in for-next/epan, pte_valid_user()
disappeared as its semantics weren't very clear.


Thanks for pointing that out.


So this relies on the set_pte_at() being done on the VMM address space.
I wonder, if the VMM did an mprotect(PROT_NONE), can the VM still access
it via stage 2? If yes, the pte_valid_user() test wouldn't work. We need
something like pte_present() && addr <= user_addr_max().


AFAIUI the stage 2 matches the VMM's address space (for the subset that has
memslots). So mprotect(PROT_NONE) would cause the stage 2 mapping to be
invalidated and a subsequent fault would exit to the VMM to sort out. This
sort of thing is done for the lazy migration use case (i.e. pages are
fetched as the VM tries to access them).


There's also the protected KVM case which IIUC wouldn't provide any
mapping of the guest memory to the host (or maybe the host still thinks
it's there but cannot access it without a Stage 2 fault). At least in
this case it wouldn't swap pages out and it would be the responsibility
of the EL2 code to clear the tags when giving pages to the guest
(user_mem_abort() must not touch the page).

So basically we either have a valid, accessible mapping in the VMM and
we can handle the tags via set_pte_at() or we leave it to whatever is
running at EL2 in the pKVM case.


For the pKVM case it's up to the EL2 code to hand over suitably scrubbed 
pages to the guest, and the host doesn't have access to the pages so we 
(currently) don't have to worry about swap. If swap get implemented it 
will presumably be up to the EL2 code to package up both the normal data 
and the MTE tags into an encrypted bundle for the host to stash somewhere.



I don't remember whether we had a clear conclusion in the past: have we
ruled out requiring the VMM to map the guest memory with PROT_MTE
entirely? IIRC a potential problem was the VMM using MTE itself and
having to disable it when accessing the guest memory.


Yes, there are some ugly corner cases if we require the VMM to map with 
PROT_MTE. Hence patch 5 - an ioctl to allow the VMM to access the tags 
without having to maintain a PROT_MTE mapping.



Another potential issue (I haven't got my head around it yet) is a race
in mte_sync_tags() as we now defer the PG_mte_tagged bit setting until
after the tags had been restored. Can we have the same page mapped by
two ptes, each attempting to restore it from swap and one gets it first
and starts modifying it? Given that we set the actual pte after setting
PG_mte_tagged, it's probably alright but I think we miss some barriers.


I'm not sure if I've got my head round this one yet either, but you 
could be right there's a race. This exists without these patches:


CPU 1|  CPU 2
-+-
set_pte_at() |
--> mte_sync_tags()  |
--> test_and_set_bit()   |
--> mte_sync_page_tags() | set_pte_at()
   [stalls/sleeps]   | --> mte_sync_tags()
 | --> test_and_set_bit()
 | [already set by CPU 1]
 | set_pte()
 | [sees stale tags]
   [eventually wakes up  |
and sets tags]   |

What I'm struggling to get my head around is whether there's always a 
sufficient lock held during the call to set_pte_at() to avoid the above. 
I suspect not because the two calls could be in completely separate 
processes.


We potentially could stick a lock_page()/unlock_page() sequence in 
mte_sync_tags(). I just ran a basic test and didn't hit problems with 
that. Any thoughts?



Also, if a page is not a swap one, we currently clear the tags if mapped
as pte_tagged() (prior to this patch). We'd need 

Re: [PATCH] target/arm: Make number of counters in PMCR follow the CPU

2021-03-31 Thread Peter Maydell
On Wed, 31 Mar 2021 at 09:59, Zenghui Yu  wrote:
>
> [+kvmarm, Marc]
>
> On 2021/3/12 0:59, Peter Maydell wrote:
> > Currently we give all the v7-and-up CPUs a PMU with 4 counters.  This
> > means that we don't provide the 6 counters that are required by the
> > Arm BSA (Base System Architecture) specification if the CPU supports
> > the Virtualization extensions.

> So I've tested it with kvm and I get the following error before
> VM startup:
>
>"qemu-system-aarch64: Failed to retrieve host CPU features"
>
> > ---
> >   target/arm/cpu.h |  1 +
> >   target/arm/cpu64.c   |  3 +++
> >   target/arm/cpu_tcg.c |  5 +
> >   target/arm/helper.c  | 29 +
> >   target/arm/kvm64.c   |  2 ++
> >   5 files changed, 28 insertions(+), 12 deletions(-)
>
> [...]
>
> > diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
> > index dff85f6db94..581335e49d3 100644
> > --- a/target/arm/kvm64.c
> > +++ b/target/arm/kvm64.c
> > @@ -566,6 +566,8 @@ bool kvm_arm_get_host_cpu_features(ARMHostCPUFeatures 
> > *ahcf)
> > ARM64_SYS_REG(3, 0, 0, 7, 1));
> >   err |= read_sys_reg64(fdarray[2], >isar.id_aa64mmfr2,
> > ARM64_SYS_REG(3, 0, 0, 7, 2));
> > +err |= read_sys_reg64(fdarray[2], >isar.reset_pmcr_el0,
> > +  ARM64_SYS_REG(3, 3, 9, 12, 0));
>
> Looks like we tried to access PMCR_EL0 *before* telling kvm that
> KVM_ARM_VCPU_PMU_V3 feature should be supported, which is now
> refused by kvm [*].
>
> [*] https://git.kernel.org/torvalds/c/11663111cd49

Oops, that's embarrassing. I should have tested with KVM, and
I forgot the PMU needs special enablement :-(

I'm on holiday right now so I probably won't have time to
look at a fix for rc2. I might just revert this commit.

thanks
-- PMM
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers

2021-03-31 Thread Paolo Bonzini

On 31/03/21 11:34, Marc Zyngier wrote:



Queued and 1-9 and 18, thanks.  There's a small issue in patch 10 that
prevented me from committing 10-15, but they mostly look good.


Can you please push the resulting merge somewhere?

I'm concerned that it will conflict in interesting way with other stuff
that is on its way on the arm64 side, not to mentiobn that this hasn't
been tested at all on anything but x86 (and given the series was posted
on Friday, that's a bit of a short notice).


Yes, I will push it shortly to kvm/queue.  Note that the patches I have 
pushed are x86 only apart from changes to tracepoints.  The rest will 
certainly need a lot more baking, which is also why I got rid quickly of 
the easy ones.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH -next] KVM: arm64: Make symbol '_kvm_host_prot_finalize' static

2021-03-31 Thread Xu Jia
The sparse tool complains as follows:

arch/arm64/kvm/arm.c:1900:6: warning:
 symbol '_kvm_host_prot_finalize' was not declared. Should it be static?

This symbol is not used outside of arm.c, so this
commit marks it static.

Reported-by: Hulk Robot 
Signed-off-by: Xu Jia 
---
 arch/arm64/kvm/arm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 2835400fd298..76a7de16e5a6 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1897,7 +1897,7 @@ static int init_hyp_mode(void)
return err;
 }
 
-void _kvm_host_prot_finalize(void *discard)
+static void _kvm_host_prot_finalize(void *discard)
 {
WARN_ON(kvm_call_hyp_nvhe(__pkvm_prot_finalize));
 }

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers

2021-03-31 Thread Marc Zyngier

On 2021-03-31 08:57, Paolo Bonzini wrote:


Queued and 1-9 and 18, thanks.  There's a small issue in patch 10 that
prevented me from committing 10-15, but they mostly look good.


Can you please push the resulting merge somewhere?

I'm concerned that it will conflict in interesting way with other stuff
that is on its way on the arm64 side, not to mentiobn that this hasn't
been tested at all on anything but x86 (and given the series was posted
on Friday, that's a bit of a short notice).

Thanks,

M.
--
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

2021-03-31 Thread David Hildenbrand

On 31.03.21 11:21, Catalin Marinas wrote:

On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:

On 30.03.21 12:30, Catalin Marinas wrote:

On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:

On 28/03/2021 13:21, Catalin Marinas wrote:

On Sat, Mar 27, 2021 at 03:23:24PM +, Catalin Marinas wrote:

On Fri, Mar 12, 2021 at 03:18:58PM +, Steven Price wrote:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..b31b7a821f90 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (vma_pagesize == PAGE_SIZE && !force_pte)
vma_pagesize = transparent_hugepage_adjust(memslot, hva,
   , _ipa);
+
+   if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) {
+   /*
+* VM will be able to see the page's tags, so we must ensure
+* they have been initialised. if PG_mte_tagged is set, tags
+* have already been initialised.
+*/
+   struct page *page = pfn_to_page(pfn);
+   unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
+
+   for (i = 0; i < nr_pages; i++, page++) {
+   if (!test_and_set_bit(PG_mte_tagged, >flags))
+   mte_clear_page_tags(page_address(page));
+   }
+   }


This pfn_valid() check may be problematic. Following commit eeb0753ba27b
("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns
true for ZONE_DEVICE memory but such memory is allowed not to support
MTE.


Some more thinking, this should be safe as any ZONE_DEVICE would be
mapped as untagged memory in the kernel linear map. It could be slightly
inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
untagged memory. Another overhead is pfn_valid() which will likely end
up calling memblock_is_map_memory().

However, the bigger issue is that Stage 2 cannot disable tagging for
Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a
way to detect what gets mapped in the guest as Normal Cacheable memory
and make sure it's only early memory or hotplug but no ZONE_DEVICE (or
something else like on-chip memory)?  If we can't guarantee that all
Cacheable memory given to a guest supports tags, we should disable the
feature altogether.


In stage 2 I believe we only have two types of mapping - 'normal' or
DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a
case of checking the 'device' variable, and makes sense to avoid the
overhead you describe.

This should also guarantee that all stage-2 cacheable memory supports tags,
as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only
be true for memory that Linux considers "normal".


If you think "normal" == "normal System RAM", that's wrong; see below.


By "normal" I think both Steven and I meant the Normal Cacheable memory
attribute (another being the Device memory attribute).


That's the problem. With Anshuman's commit I mentioned above,
pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
memory, not talking about some I/O mapping that requires Device_nGnRE).
So kvm_is_device_pfn() is false for such memory and it may be mapped as
Normal but it is not guaranteed to support tagging.


pfn_valid() means "there is a struct page"; if you do pfn_to_page() and
touch the page, you won't fault. So Anshuman's commit is correct.


I agree.


pfn_to_online_page() means, "there is a struct page and it's system RAM
that's in use; the memmap has a sane content"


Does pfn_to_online_page() returns a valid struct page pointer for
ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for
some definition of system RAM (I assume NVDIMM != system RAM). For
example, pmem_attach_disk() calls devm_memremap_pages() and this would
use the Normal Cacheable memory attribute without necessarily being
system RAM.


No, not for ZONE_DEVICE.

However, if you expose PMEM via dax/kmem as System RAM to the system (-> 
add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or 
ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system 
thinks it's "ordinary system RAM" and the memory is managed by the buddy.




So if pfn_valid() is not equivalent to system RAM, we have a potential
issue with MTE. Even if "system RAM" includes NVDIMMs, we still have
this issue and we may need a new term to describe MTE-safe memory. In
the kernel we assume MTE-safe all pages that can be mapped as
MAP_ANONYMOUS and I don't think these include ZONE_DEVICE pages.

Thanks.




--
Thanks,

David / dhildenb

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

2021-03-31 Thread Catalin Marinas
On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:
> On 30.03.21 12:30, Catalin Marinas wrote:
> > On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
> > > On 28/03/2021 13:21, Catalin Marinas wrote:
> > > > On Sat, Mar 27, 2021 at 03:23:24PM +, Catalin Marinas wrote:
> > > > > On Fri, Mar 12, 2021 at 03:18:58PM +, Steven Price wrote:
> > > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > > > index 77cb2d28f2a4..b31b7a821f90 100644
> > > > > > --- a/arch/arm64/kvm/mmu.c
> > > > > > +++ b/arch/arm64/kvm/mmu.c
> > > > > > @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu 
> > > > > > *vcpu, phys_addr_t fault_ipa,
> > > > > > if (vma_pagesize == PAGE_SIZE && !force_pte)
> > > > > > vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> > > > > >, 
> > > > > > _ipa);
> > > > > > +
> > > > > > +   if (fault_status != FSC_PERM && kvm_has_mte(kvm) && 
> > > > > > pfn_valid(pfn)) {
> > > > > > +   /*
> > > > > > +* VM will be able to see the page's tags, so we must 
> > > > > > ensure
> > > > > > +* they have been initialised. if PG_mte_tagged is set, 
> > > > > > tags
> > > > > > +* have already been initialised.
> > > > > > +*/
> > > > > > +   struct page *page = pfn_to_page(pfn);
> > > > > > +   unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
> > > > > > +
> > > > > > +   for (i = 0; i < nr_pages; i++, page++) {
> > > > > > +   if (!test_and_set_bit(PG_mte_tagged, 
> > > > > > >flags))
> > > > > > +   mte_clear_page_tags(page_address(page));
> > > > > > +   }
> > > > > > +   }
> > > > > 
> > > > > This pfn_valid() check may be problematic. Following commit 
> > > > > eeb0753ba27b
> > > > > ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns
> > > > > true for ZONE_DEVICE memory but such memory is allowed not to support
> > > > > MTE.
> > > > 
> > > > Some more thinking, this should be safe as any ZONE_DEVICE would be
> > > > mapped as untagged memory in the kernel linear map. It could be slightly
> > > > inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
> > > > untagged memory. Another overhead is pfn_valid() which will likely end
> > > > up calling memblock_is_map_memory().
> > > > 
> > > > However, the bigger issue is that Stage 2 cannot disable tagging for
> > > > Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a
> > > > way to detect what gets mapped in the guest as Normal Cacheable memory
> > > > and make sure it's only early memory or hotplug but no ZONE_DEVICE (or
> > > > something else like on-chip memory)?  If we can't guarantee that all
> > > > Cacheable memory given to a guest supports tags, we should disable the
> > > > feature altogether.
> > > 
> > > In stage 2 I believe we only have two types of mapping - 'normal' or
> > > DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter 
> > > is a
> > > case of checking the 'device' variable, and makes sense to avoid the
> > > overhead you describe.
> > > 
> > > This should also guarantee that all stage-2 cacheable memory supports 
> > > tags,
> > > as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only
> > > be true for memory that Linux considers "normal".
> 
> If you think "normal" == "normal System RAM", that's wrong; see below.

By "normal" I think both Steven and I meant the Normal Cacheable memory
attribute (another being the Device memory attribute).

> > That's the problem. With Anshuman's commit I mentioned above,
> > pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
> > memory, not talking about some I/O mapping that requires Device_nGnRE).
> > So kvm_is_device_pfn() is false for such memory and it may be mapped as
> > Normal but it is not guaranteed to support tagging.
> 
> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and
> touch the page, you won't fault. So Anshuman's commit is correct.

I agree.

> pfn_to_online_page() means, "there is a struct page and it's system RAM
> that's in use; the memmap has a sane content"

Does pfn_to_online_page() returns a valid struct page pointer for
ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for
some definition of system RAM (I assume NVDIMM != system RAM). For
example, pmem_attach_disk() calls devm_memremap_pages() and this would
use the Normal Cacheable memory attribute without necessarily being
system RAM.

So if pfn_valid() is not equivalent to system RAM, we have a potential
issue with MTE. Even if "system RAM" includes NVDIMMs, we still have
this issue and we may need a new term to describe MTE-safe memory. In
the kernel we assume MTE-safe all pages that can be mapped as
MAP_ANONYMOUS and I don't think these include ZONE_DEVICE pages.

Thanks.

-- 
Catalin

Re: [RFC PATCH v2 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-03-31 Thread Keqian Zhu
Kind ping...

On 2021/3/16 21:43, Keqian Zhu wrote:
> Hi all,
> 
> We have two pathes to build stage2 mapping for MMIO regions.
> 
> Create time's path and stage2 fault path.
> 
> Patch#1 removes the creation time's mapping of MMIO regions
> Patch#2 tries stage2 block mapping for host device MMIO at fault path
> 
> Thanks,
> Keqian
> 
> Keqian Zhu (2):
>   kvm/arm64: Remove the creation time's mapping of MMIO regions
>   kvm/arm64: Try stage2 block mapping for host device MMIO
> 
>  arch/arm64/kvm/mmu.c | 80 +++-
>  1 file changed, 41 insertions(+), 39 deletions(-)
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] target/arm: Make number of counters in PMCR follow the CPU

2021-03-31 Thread Zenghui Yu

[+kvmarm, Marc]

On 2021/3/12 0:59, Peter Maydell wrote:

Currently we give all the v7-and-up CPUs a PMU with 4 counters.  This
means that we don't provide the 6 counters that are required by the
Arm BSA (Base System Architecture) specification if the CPU supports
the Virtualization extensions.

Instead of having a single PMCR_NUM_COUNTERS, make each CPU type
specify the PMCR reset value (obtained from the appropriate TRM), and
use the 'N' field of that value to define the number of counters
provided.

This means that we now supply 6 counters for Cortex-A53, A57, A72,
A15 and A9 as well as '-cpu max'; Cortex-A7 and A8 stay at 4; and
Cortex-R5 goes down to 3.

Note that because we now use the PMCR reset value of the specific
implementation, we no longer set the LC bit out of reset.  This has
an UNKNOWN value out of reset for all cores with any AArch32 support,
so guest software should be setting it anyway if it wants it.

Signed-off-by: Peter Maydell 
---
This is pretty much untested (I just checked Linux still boots;
haven't tried it with KVM either). It's an alternative to
just bumping PMCR_NUM_COUNTERS to 6.


So I've tested it with kvm and I get the following error before
VM startup:

  "qemu-system-aarch64: Failed to retrieve host CPU features"


---
  target/arm/cpu.h |  1 +
  target/arm/cpu64.c   |  3 +++
  target/arm/cpu_tcg.c |  5 +
  target/arm/helper.c  | 29 +
  target/arm/kvm64.c   |  2 ++
  5 files changed, 28 insertions(+), 12 deletions(-)


[...]


diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
index dff85f6db94..581335e49d3 100644
--- a/target/arm/kvm64.c
+++ b/target/arm/kvm64.c
@@ -566,6 +566,8 @@ bool kvm_arm_get_host_cpu_features(ARMHostCPUFeatures *ahcf)
ARM64_SYS_REG(3, 0, 0, 7, 1));
  err |= read_sys_reg64(fdarray[2], >isar.id_aa64mmfr2,
ARM64_SYS_REG(3, 0, 0, 7, 2));
+err |= read_sys_reg64(fdarray[2], >isar.reset_pmcr_el0,
+  ARM64_SYS_REG(3, 3, 9, 12, 0));


Looks like we tried to access PMCR_EL0 *before* telling kvm that
KVM_ARM_VCPU_PMU_V3 feature should be supported, which is now
refused by kvm [*].

[*] https://git.kernel.org/torvalds/c/11663111cd49
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-03-31 Thread Paolo Bonzini

On 26/03/21 03:19, Sean Christopherson wrote:

+   /*
+* Reset the lock used to prevent memslot updates between MMU notifier
+* range_start and range_end.  At this point no more MMU notifiers will
+* run, but the lock could still be held if KVM's notifier was removed
+* between range_start and range_end.  No threads can be waiting on the
+* lock as the last reference on KVM has been dropped.  If the lock is
+* still held, freeing memslots will deadlock.
+*/
+   init_rwsem(>mmu_notifier_slots_lock);


I was going to say that this is nasty, then I noticed that 
mmu_notifier_unregister uses SRCU to ensure completion of concurrent 
calls to the MMU notifier.  So I guess it's fine, but it's better to 
point it out:


/*
 * At this point no more MMU notifiers will run and pending
 * calls to range_start have completed, but the lock would
 * still be held and never released if the MMU notifier was
 * removed between range_start and range_end.  Since the last
 * reference to the struct kvm has been dropped, no threads can
 * be waiting on the lock, but we might still end up taking it
 * when freeing memslots in kvm_arch_destroy_vm.  Reset the lock
 * to avoid deadlocks.
 */

That said, the easiest way to avoid this would be to always update 
mmu_notifier_count.  I don't mind the rwsem, but at least I suggest that 
you split the patch in two---the first one keeping the 
mmu_notifier_count update unconditional, and the second one introducing 
the rwsem and the on_lock function kvm_inc_notifier_count.  Please 
document the new lock in Documentation/virt/kvm/locking.rst too.


Also, related to the first part of the series, perhaps you could 
structure the series in a slightly different way:


1) introduce the HVA walking API in common code, complete with on_lock 
and patch 15, so that you can use on_lock to increase mmu_notifier_seq


2) then migrate all architectures including x86 to the new API

IOW, first half of patch 10 and all of patch 15; then the second half of 
patch 10; then patches 11-14.



+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+   down_write(>mmu_notifier_slots_lock);
+#endif
rcu_assign_pointer(kvm->memslots[as_id], slots);
+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+   up_write(>mmu_notifier_slots_lock);
+#endif


Please do this unconditionally, the cost is minimal if the rwsem is not 
contended (as is the case if the architecture doesn't use MMU notifiers 
at all).


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH -next] KVM: arm64: Make symbol '_kvm_host_prot_finalize' static

2021-03-31 Thread Marc Zyngier
On Wed, 31 Mar 2021 15:36:19 +0800, Xu Jia wrote:
> The sparse tool complains as follows:
> 
> arch/arm64/kvm/arm.c:1900:6: warning:
>  symbol '_kvm_host_prot_finalize' was not declared. Should it be static?
> 
> This symbol is not used outside of arm.c, so this
> commit marks it static.

Applied to kvm-arm64/host-stage2, thanks!

[1/1] KVM: arm64: Make symbol '_kvm_host_prot_finalize' static
  commit: b1306fef1f48c0af1d659c18c53cf275fdcc94be

Cheers,

M.
-- 
Without deviation from the norm, progress is not possible.


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 00/18] KVM: Consolidate and optimize MMU notifiers

2021-03-31 Thread Paolo Bonzini

On 26/03/21 03:19, Sean Christopherson wrote:

The end goal of this series is to optimize the MMU notifiers to take
mmu_lock if and only if the notification is relevant to KVM, i.e. the hva
range overlaps a memslot.   Large VMs (hundreds of vCPUs) are very
sensitive to mmu_lock being taken for write at inopportune times, and
such VMs also tend to be "static", e.g. backed by HugeTLB with minimal
page shenanigans.  The vast majority of notifications for these VMs will
be spurious (for KVM), and eliding mmu_lock for spurious notifications
avoids an otherwise unacceptable disruption to the guest.

To get there without potentially degrading performance, e.g. due to
multiple memslot lookups, especially on non-x86 where the use cases are
largely unknown (from my perspective), first consolidate the MMU notifier
logic by moving the hva->gfn lookups into common KVM.

Applies on my TDP MMU TLB flushing bug fixes[*], which conflict horribly
with the TDP MMU changes in this series.  That code applies on kvm/queue
(commit 4a98623d5d90, "KVM: x86/mmu: Mark the PAE roots as decrypted for
shadow paging").

Speaking of conflicts, Ben will soon be posting a series to convert a
bunch of TDP MMU flows to take mmu_lock only for read.  Presumably there
will be an absurd number of conflicts; Ben and I will sort out the
conflicts in whichever series loses the race.

Well tested on Intel and AMD.  Compile tested for arm64, MIPS, PPC,
PPC e500, and s390.  Absolutely needs to be tested for real on non-x86,
I give it even odds that I introduced an off-by-one bug somewhere.

[*] https://lkml.kernel.org/r/20210325200119.1359384-1-sea...@google.com


Patches 1-7 are x86 specific prep patches to play nice with moving
the hva->gfn memslot lookups into common code.  There ended up being waaay
more of these than I expected/wanted, but I had a hell of a time getting
the flushing logic right when shuffling the memslot and address space
loops.  In the end, I was more confident I got things correct by batching
the flushes.

Patch 8 moves the existing API prototypes into common code.  It could
technically be dropped since the old APIs are gone in the end, but I
thought the switch to the new APIs would suck a bit less this way.

Patch 9 moves arm64's MMU notifier tracepoints into common code so that
they are not lost when arm64 is converted to the new APIs, and so that all
architectures can benefit.

Patch 10 moves x86's memslot walkers into common KVM.  I chose x86 purely
because I could actually test it.  All architectures use nearly identical
code, so I don't think it actually matters in the end.

Patches 11-13 move arm64, MIPS, and PPC to the new APIs.

Patch 14 yanks out the old APIs.

Patch 15 adds the mmu_lock elision, but only for unpaired notifications.

Patch 16 adds mmu_lock elision for paired .invalidate_range_{start,end}().
This is quite nasty and no small part of me thinks the patch should be
burned with fire (I won't spoil it any further), but it's also the most
problematic scenario for our particular use case.  :-/

Patches 17-18 are additional x86 cleanups.


Queued and 1-9 and 18, thanks.  There's a small issue in patch 10 that 
prevented me from committing 10-15, but they mostly look good.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 10/18] KVM: Move x86's MMU notifier memslot walkers to generic code

2021-03-31 Thread Paolo Bonzini

On 26/03/21 03:19, Sean Christopherson wrote:

+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+   kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
+#else
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int idx;
trace_kvm_set_spte_hva(address);
 
	idx = srcu_read_lock(>srcu);


KVM_MMU_LOCK(kvm);

kvm->mmu_notifier_seq++;

if (kvm_set_spte_hva(kvm, address, pte))
kvm_flush_remote_tlbs(kvm);

KVM_MMU_UNLOCK(kvm);
srcu_read_unlock(>srcu, idx);
+#endif


The kvm->mmu_notifier_seq is missing in the new API side.  I guess you 
can add an argument to __kvm_handle_hva_range and handle it also in 
patch 15 ("KVM: Take mmu_lock when handling MMU notifier iff the hva 
hits a memslot").


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 16/18] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-03-31 Thread Paolo Bonzini

On 26/03/21 03:19, Sean Christopherson wrote:

Avoid taking mmu_lock for unrelated .invalidate_range_{start,end}()
notifications.  Because mmu_notifier_count must be modified while holding
mmu_lock for write, and must always be paired across start->end to stay
balanced, lock elision must happen in both or none.  To meet that
requirement, add a rwsem to prevent memslot updates across range_start()
and range_end().

For notifiers that disallow blocking, e.g. OOM reaping, simply go down
the slow path of unconditionally acquiring mmu_lock.  The sane
alternative would be to try to acquire the lock and force the notifier
to retry on failure.  But since OOM is currently the _only_ scenario
where blocking is disallowed attempting to optimize a guest that has been
marked for death is pointless.

Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.

Based heavily on code from Ben Gardon.

Suggested-by: Ben Gardon 
Signed-off-by: Sean Christopherson 


Please submit this as a separate patch.

Paolo


---
  include/linux/kvm_host.h |   8 +-
  virt/kvm/kvm_main.c  | 174 ++-
  2 files changed, 142 insertions(+), 40 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 40ac2d40bb5a..2cc0f87d936e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -523,6 +523,7 @@ struct kvm {
long mmu_notifier_count;
unsigned long mmu_notifier_range_start;
unsigned long mmu_notifier_range_end;
+   struct rw_semaphore mmu_notifier_slots_lock;
  #endif
long tlbs_dirty;
struct list_head devices;
@@ -660,8 +661,11 @@ static inline struct kvm_memslots *__kvm_memslots(struct 
kvm *kvm, int as_id)
  {
as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
return srcu_dereference_check(kvm->memslots[as_id], >srcu,
-   lockdep_is_held(>slots_lock) ||
-   !refcount_read(>users_count));
+ lockdep_is_held(>slots_lock) ||
+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+ 
lockdep_is_held(>mmu_notifier_slots_lock) ||
+#endif
+ !refcount_read(>users_count));
  }
  
  static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0c2aff8a4aa1..9ebc6d3e4a21 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -453,20 +453,56 @@ static void kvm_mmu_notifier_invalidate_range(struct 
mmu_notifier *mn,
  
  typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
  
+typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,

+unsigned long end);
+
  struct kvm_hva_range {
unsigned long start;
unsigned long end;
pte_t pte;
hva_handler_t handler;
-   bool caller_locked;
+   on_lock_fn_t on_lock;
+   bool must_lock;
bool flush_on_ret;
bool may_block;
  };
  
+/*

+ * Use a dedicated stub instead of NULL to indicate that there is no callback
+ * function/handler.  The compiler technically can't guarantee that a real
+ * function will have a non-zero address, and so it will generate code to
+ * check for !NULL, whereas comparing against a stub will be elided at compile
+ * time (unless the compiler is getting long in the tooth, e.g. gcc 4.9).
+ */
+static void kvm_null_fn(void)
+{
+
+}
+#define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
+
+
+/* Acquire mmu_lock if necessary.  Returns %true if @handler is "null" */
+static __always_inline bool kvm_mmu_lock_and_check_handler(struct kvm *kvm,
+  const struct 
kvm_hva_range *range,
+  bool *locked)
+{
+   if (*locked)
+   return false;
+
+   *locked = true;
+
+   KVM_MMU_LOCK(kvm);
+
+   if (!IS_KVM_NULL_FN(range->on_lock))
+   range->on_lock(kvm, range->start, range->end);
+
+   return IS_KVM_NULL_FN(range->handler);
+}
+
  static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
  const struct kvm_hva_range 
*range)
  {
-   bool ret = false, locked = range->caller_locked;
+   bool ret = false, locked = false;
struct kvm_gfn_range gfn_range;
struct kvm_memory_slot *slot;
struct kvm_memslots *slots;
@@ -474,6 +510,10 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
  
  	idx = srcu_read_lock(>srcu);
  
+	if (range->must_lock &&

+   kvm_mmu_lock_and_check_handler(kvm, range, ))
+   goto out_unlock;
+
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
   

Re: [PATCH 12/18] KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks

2021-03-31 Thread Paolo Bonzini

On 26/03/21 03:19, Sean Christopherson wrote:

Move MIPS to the gfn-based MMU notifier APIs, which do the hva->gfn
lookup in common code, and whose code is nearly identical to MIPS'
lookup.

No meaningful functional change intended, though the exact order of
operations is slightly different since the memslot lookups occur before
calling into arch code.

Signed-off-by: Sean Christopherson 


I'll post a couple patches to enable more coalescing of the flushes, but 
this particular patch is okay.


Paolo


---
  arch/mips/include/asm/kvm_host.h |  1 +
  arch/mips/kvm/mmu.c  | 97 ++--
  2 files changed, 17 insertions(+), 81 deletions(-)

diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index feaa77036b67..374a3c8806e8 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -967,6 +967,7 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct 
kvm_vcpu *vcpu,
   bool write);
  
  #define KVM_ARCH_WANT_MMU_NOTIFIER

+#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
  
  /* Emulation */

  int kvm_get_inst(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index 3dabeda82458..3dc885df2e32 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -439,85 +439,36 @@ static int kvm_mips_mkold_gpa_pt(struct kvm *kvm, gfn_t 
start_gfn,
  end_gfn << PAGE_SHIFT);
  }
  
-static int handle_hva_to_gpa(struct kvm *kvm,

-unsigned long start,
-unsigned long end,
-int (*handler)(struct kvm *kvm, gfn_t gfn,
-   gpa_t gfn_end,
-   struct kvm_memory_slot *memslot,
-   void *data),
-void *data)
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
  {
-   struct kvm_memslots *slots;
-   struct kvm_memory_slot *memslot;
-   int ret = 0;
-
-   slots = kvm_memslots(kvm);
-
-   /* we only care about the pages that the guest sees */
-   kvm_for_each_memslot(memslot, slots) {
-   unsigned long hva_start, hva_end;
-   gfn_t gfn, gfn_end;
-
-   hva_start = max(start, memslot->userspace_addr);
-   hva_end = min(end, memslot->userspace_addr +
-   (memslot->npages << PAGE_SHIFT));
-   if (hva_start >= hva_end)
-   continue;
-
-   /*
-* {gfn(page) | page intersects with [hva_start, hva_end)} =
-* {gfn_start, gfn_start+1, ..., gfn_end-1}.
-*/
-   gfn = hva_to_gfn_memslot(hva_start, memslot);
-   gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
-
-   ret |= handler(kvm, gfn, gfn_end, memslot, data);
-   }
-
-   return ret;
-}
-
-
-static int kvm_unmap_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
-struct kvm_memory_slot *memslot, void *data)
-{
-   kvm_mips_flush_gpa_pt(kvm, gfn, gfn_end);
-   return 1;
-}
-
-int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long 
end,
-   unsigned flags)
-{
-   handle_hva_to_gpa(kvm, start, end, _unmap_hva_handler, NULL);
+   kvm_mips_flush_gpa_pt(kvm, range->start, range->end);
  
  	kvm_mips_callbacks->flush_shadow_all(kvm);

return 0;
  }
  
-static int kvm_set_spte_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,

-   struct kvm_memory_slot *memslot, void *data)
+static bool __kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
  {
-   gpa_t gpa = gfn << PAGE_SHIFT;
-   pte_t hva_pte = *(pte_t *)data;
+   gpa_t gpa = range->start << PAGE_SHIFT;
+   pte_t hva_pte = range->pte;
pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
pte_t old_pte;
  
  	if (!gpa_pte)

-   return 0;
+   return false;
  
  	/* Mapping may need adjusting depending on memslot flags */

old_pte = *gpa_pte;
-   if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte))
+   if (range->slot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte))
hva_pte = pte_mkclean(hva_pte);
-   else if (memslot->flags & KVM_MEM_READONLY)
+   else if (range->slot->flags & KVM_MEM_READONLY)
hva_pte = pte_wrprotect(hva_pte);
  
  	set_pte(gpa_pte, hva_pte);
  
  	/* Replacing an absent or old page doesn't need flushes */

if (!pte_present(old_pte) || !pte_young(old_pte))
-   return 0;
+   return false;
  
  	/* Pages swapped, aged, moved, or cleaned require flushes */

return !pte_present(hva_pte) ||
@@ -526,27 

Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature

2021-03-31 Thread David Hildenbrand

On 30.03.21 12:30, Catalin Marinas wrote:

On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:

On 28/03/2021 13:21, Catalin Marinas wrote:

On Sat, Mar 27, 2021 at 03:23:24PM +, Catalin Marinas wrote:

On Fri, Mar 12, 2021 at 03:18:58PM +, Steven Price wrote:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..b31b7a821f90 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (vma_pagesize == PAGE_SIZE && !force_pte)
vma_pagesize = transparent_hugepage_adjust(memslot, hva,
   , _ipa);
+
+   if (fault_status != FSC_PERM && kvm_has_mte(kvm) && pfn_valid(pfn)) {
+   /*
+* VM will be able to see the page's tags, so we must ensure
+* they have been initialised. if PG_mte_tagged is set, tags
+* have already been initialised.
+*/
+   struct page *page = pfn_to_page(pfn);
+   unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
+
+   for (i = 0; i < nr_pages; i++, page++) {
+   if (!test_and_set_bit(PG_mte_tagged, >flags))
+   mte_clear_page_tags(page_address(page));
+   }
+   }


This pfn_valid() check may be problematic. Following commit eeb0753ba27b
("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it returns
true for ZONE_DEVICE memory but such memory is allowed not to support
MTE.


Some more thinking, this should be safe as any ZONE_DEVICE would be
mapped as untagged memory in the kernel linear map. It could be slightly
inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
untagged memory. Another overhead is pfn_valid() which will likely end
up calling memblock_is_map_memory().

However, the bigger issue is that Stage 2 cannot disable tagging for
Stage 1 unless the memory is Non-cacheable or Device at S2. Is there a
way to detect what gets mapped in the guest as Normal Cacheable memory
and make sure it's only early memory or hotplug but no ZONE_DEVICE (or
something else like on-chip memory)?  If we can't guarantee that all
Cacheable memory given to a guest supports tags, we should disable the
feature altogether.


In stage 2 I believe we only have two types of mapping - 'normal' or
DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the latter is a
case of checking the 'device' variable, and makes sense to avoid the
overhead you describe.

This should also guarantee that all stage-2 cacheable memory supports tags,
as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid() should only
be true for memory that Linux considers "normal".


If you think "normal" == "normal System RAM", that's wrong; see below.



That's the problem. With Anshuman's commit I mentioned above,
pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
memory, not talking about some I/O mapping that requires Device_nGnRE).
So kvm_is_device_pfn() is false for such memory and it may be mapped as
Normal but it is not guaranteed to support tagging.


pfn_valid() means "there is a struct page"; if you do pfn_to_page() and 
touch the page, you won't fault. So Anshuman's commit is correct.


pfn_to_online_page() means, "there is a struct page and it's system RAM 
that's in use; the memmap has a sane content"




For user MTE, we get away with this as the MAP_ANONYMOUS requirement
would filter it out while arch_add_memory() will ensure it's mapped as
untagged in the linear map. See another recent fix for hotplugged
memory: d15dfd31384b ("arm64: mte: Map hotplugged memory as Normal
Tagged"). We needed to ensure that ZONE_DEVICE doesn't end up as tagged,
only hoplugged memory. Both handled via arch_add_memory() in the arch
code with ZONE_DEVICE starting at devm_memremap_pages().


I now wonder if we can get a MAP_ANONYMOUS mapping of ZONE_DEVICE pfn
even without virtualisation.


I haven't checked all the code paths but I don't think we can get a
MAP_ANONYMOUS mapping of ZONE_DEVICE memory as we normally need a file
descriptor.


I certainly hope this is the case - it's the weird corner cases of device
drivers that worry me. E.g. I know i915 has a "hidden" mmap behind an ioctl
(see i915_gem_mmap_ioctl(), although this case is fine - it's MAP_SHARED).
Mali's kbase did something similar in the past.


I think this should be fine since it's not a MAP_ANONYMOUS (we do allow
MAP_SHARED to be tagged).




--
Thanks,

David / dhildenb

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm