Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-19 Thread Sean Christopherson
On Fri, Apr 19, 2024, Will Deacon wrote:
> > @@ -663,10 +669,22 @@ static __always_inline kvm_mn_ret_t 
> > __kvm_handle_hva_range(struct kvm *kvm,
> > break;
> > }
> > r.ret |= range->handler(kvm, _range);
> > +
> > +  /*
> > +   * Use a precise gfn-based TLB flush when possible, as
> > +   * most mmu_notifier events affect a small-ish range.
> > +   * Fall back to a full TLB flush if the gfn-based flush
> > +   * fails, and don't bother trying the gfn-based flush
> > +   * if a full flush is already pending.
> > +   */
> > +  if (range->flush_on_ret && !need_flush && r.ret &&
> > +  kvm_arch_flush_remote_tlbs_range(kvm, 
> > gfn_range.start,
> > +   gfn_range.end - 
> > gfn_range.start + 1))
> 
> What's that '+ 1' needed for here?

 (a) To see if you're paying attention.
 (b) Because more is always better.
 (c) Because math is hard.
 (d) Because I haven't tested this.
 (e) Both (c) and (d).


Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-19 Thread Will Deacon
On Thu, Apr 18, 2024 at 12:53:26PM -0700, Sean Christopherson wrote:
> On Thu, Apr 18, 2024, Will Deacon wrote:
> > On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> > > On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson 
> > > >  wrote:
> > > > > 
> > > > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  
> > > > > > wrote:
> > > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > > > Also, if you're in the business of hacking the MMU notifier code, 
> > > > > > > it
> > > > > > > would be really great to change the .clear_flush_young() callback 
> > > > > > > so
> > > > > > > that the architecture could handle the TLB invalidation. At the 
> > > > > > > moment,
> > > > > > > the core KVM code invalidates the whole VMID courtesy of 
> > > > > > > 'flush_on_ret'
> > > > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > > > lighter-weight and targetted TLBI in the architecture page-table 
> > > > > > > code
> > > > > > > when we actually update the ptes for small ranges.
> > > > > > 
> > > > > > Indeed, and I was looking at this earlier this week as it has a 
> > > > > > pretty
> > > > > > devastating effect with NV (it blows the shadow S2 for that VMID, 
> > > > > > with
> > > > > > costly consequences).
> > > > > > 
> > > > > > In general, it feels like the TLB invalidation should stay with the
> > > > > > code that deals with the page tables, as it has a pretty good idea 
> > > > > > of
> > > > > > what needs to be invalidated and how -- specially on architectures
> > > > > > that have a HW-broadcast facility like arm64.
> > > > > 
> > > > > Would this be roughly on par with an in-line flush on arm64?  The 
> > > > > simpler, more
> > > > > straightforward solution would be to let architectures override 
> > > > > flush_on_ret,
> > > > > but I would prefer something like the below as x86 can also utilize a 
> > > > > range-based
> > > > > flush when running as a nested hypervisor.
> > > 
> > > ...
> > > 
> > > > I think this works for us on HW that has range invalidation, which
> > > > would already be a positive move.
> > > > 
> > > > For the lesser HW that isn't range capable, it also gives the
> > > > opportunity to perform the iteration ourselves or go for the nuclear
> > > > option if the range is larger than some arbitrary constant (though
> > > > this is additional work).
> > > > 
> > > > But this still considers the whole range as being affected by
> > > > range->handler(). It'd be interesting to try and see whether more
> > > > precise tracking is (or isn't) generally beneficial.
> > > 
> > > I assume the idea would be to let arch code do single-page invalidations 
> > > of
> > > stage-2 entries for each gfn?
> > 
> > Right, as it's the only code which knows which ptes actually ended up
> > being aged.
> > 
> > > Unless I'm having a brain fart, x86 can't make use of that functionality. 
> > >  Intel
> > > doesn't provide any way to do targeted invalidation of stage-2 mappings.  
> > > AMD
> > > provides an instruction to do broadcast invalidations, but it takes a 
> > > virtual
> > > address, i.e. a stage-1 address.  I can't tell if it's a host virtual 
> > > address or
> > > a guest virtual address, but it's a moot point because KVM doen't have 
> > > the guest
> > > virtual address, and if it's a host virtual address, there would need to 
> > > be valid
> > > mappings in the host page tables for it to work, which KVM can't 
> > > guarantee.
> > 
> > Ah, so it sounds like it would need to be an arch opt-in then.
> 
> Even if x86 (or some other arch code) could use the precise tracking, I think 
> it
> would make sense to have the behavior be arch specific.  Adding infrastructure
> to get information from arch code, only to turn around and give it back to 
> arch
> code would be odd.

Sorry, yes, that's what I had in mind. Basically, a way for the arch code
to say "I've handled the TLBI, don't worry about it."

> Unless arm64 can't do the invalidation immediately after aging the stage-2 
> PTE,
> the best/easiest solution would be to let arm64 opt out of the common TLB 
> flush
> when a SPTE is made young.
> 
> With the range-based flushing bundled in, this?
> 
> ---
>  include/linux/kvm_host.h |  2 ++
>  virt/kvm/kvm_main.c  | 40 +---
>  2 files changed, 27 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index afbc99264ffa..8fe5f5e16919 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2010,6 +2010,8 @@ extern const struct kvm_stats_header 
> kvm_vcpu_stats_header;
>  extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
>  
>  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> +int kvm_arch_flush_tlb_if_young(void);
> +
>  static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned 

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-18 Thread Sean Christopherson
On Thu, Apr 18, 2024, Will Deacon wrote:
> On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> > On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson 
> > >  wrote:
> > > > 
> > > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  
> > > > > wrote:
> > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > > > would be really great to change the .clear_flush_young() callback so
> > > > > > that the architecture could handle the TLB invalidation. At the 
> > > > > > moment,
> > > > > > the core KVM code invalidates the whole VMID courtesy of 
> > > > > > 'flush_on_ret'
> > > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > > lighter-weight and targetted TLBI in the architecture page-table 
> > > > > > code
> > > > > > when we actually update the ptes for small ranges.
> > > > > 
> > > > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > > > costly consequences).
> > > > > 
> > > > > In general, it feels like the TLB invalidation should stay with the
> > > > > code that deals with the page tables, as it has a pretty good idea of
> > > > > what needs to be invalidated and how -- specially on architectures
> > > > > that have a HW-broadcast facility like arm64.
> > > > 
> > > > Would this be roughly on par with an in-line flush on arm64?  The 
> > > > simpler, more
> > > > straightforward solution would be to let architectures override 
> > > > flush_on_ret,
> > > > but I would prefer something like the below as x86 can also utilize a 
> > > > range-based
> > > > flush when running as a nested hypervisor.
> > 
> > ...
> > 
> > > I think this works for us on HW that has range invalidation, which
> > > would already be a positive move.
> > > 
> > > For the lesser HW that isn't range capable, it also gives the
> > > opportunity to perform the iteration ourselves or go for the nuclear
> > > option if the range is larger than some arbitrary constant (though
> > > this is additional work).
> > > 
> > > But this still considers the whole range as being affected by
> > > range->handler(). It'd be interesting to try and see whether more
> > > precise tracking is (or isn't) generally beneficial.
> > 
> > I assume the idea would be to let arch code do single-page invalidations of
> > stage-2 entries for each gfn?
> 
> Right, as it's the only code which knows which ptes actually ended up
> being aged.
> 
> > Unless I'm having a brain fart, x86 can't make use of that functionality.  
> > Intel
> > doesn't provide any way to do targeted invalidation of stage-2 mappings.  
> > AMD
> > provides an instruction to do broadcast invalidations, but it takes a 
> > virtual
> > address, i.e. a stage-1 address.  I can't tell if it's a host virtual 
> > address or
> > a guest virtual address, but it's a moot point because KVM doen't have the 
> > guest
> > virtual address, and if it's a host virtual address, there would need to be 
> > valid
> > mappings in the host page tables for it to work, which KVM can't guarantee.
> 
> Ah, so it sounds like it would need to be an arch opt-in then.

Even if x86 (or some other arch code) could use the precise tracking, I think it
would make sense to have the behavior be arch specific.  Adding infrastructure
to get information from arch code, only to turn around and give it back to arch
code would be odd.

Unless arm64 can't do the invalidation immediately after aging the stage-2 PTE,
the best/easiest solution would be to let arm64 opt out of the common TLB flush
when a SPTE is made young.

With the range-based flushing bundled in, this?

---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c  | 40 +---
 2 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index afbc99264ffa..8fe5f5e16919 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2010,6 +2010,8 @@ extern const struct kvm_stats_header 
kvm_vcpu_stats_header;
 extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
 
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
+int kvm_arch_flush_tlb_if_young(void);
+
 static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 {
if (unlikely(kvm->mmu_invalidate_in_progress))
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 38b498669ef9..5ebef8ef239c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -595,6 +595,11 @@ static void kvm_null_fn(void)
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
 
+int __weak kvm_arch_flush_tlb_if_young(void)
+{
+   return true;
+}
+
 /* Iterate over each memslot intersecting [start, last] (inclusive) range */
 #define 

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-18 Thread Will Deacon
On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson  
> > wrote:
> > > 
> > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  wrote:
> > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > > would be really great to change the .clear_flush_young() callback so
> > > > > that the architecture could handle the TLB invalidation. At the 
> > > > > moment,
> > > > > the core KVM code invalidates the whole VMID courtesy of 
> > > > > 'flush_on_ret'
> > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > lighter-weight and targetted TLBI in the architecture page-table code
> > > > > when we actually update the ptes for small ranges.
> > > > 
> > > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > > costly consequences).
> > > > 
> > > > In general, it feels like the TLB invalidation should stay with the
> > > > code that deals with the page tables, as it has a pretty good idea of
> > > > what needs to be invalidated and how -- specially on architectures
> > > > that have a HW-broadcast facility like arm64.
> > > 
> > > Would this be roughly on par with an in-line flush on arm64?  The 
> > > simpler, more
> > > straightforward solution would be to let architectures override 
> > > flush_on_ret,
> > > but I would prefer something like the below as x86 can also utilize a 
> > > range-based
> > > flush when running as a nested hypervisor.
> 
> ...
> 
> > I think this works for us on HW that has range invalidation, which
> > would already be a positive move.
> > 
> > For the lesser HW that isn't range capable, it also gives the
> > opportunity to perform the iteration ourselves or go for the nuclear
> > option if the range is larger than some arbitrary constant (though
> > this is additional work).
> > 
> > But this still considers the whole range as being affected by
> > range->handler(). It'd be interesting to try and see whether more
> > precise tracking is (or isn't) generally beneficial.
> 
> I assume the idea would be to let arch code do single-page invalidations of
> stage-2 entries for each gfn?

Right, as it's the only code which knows which ptes actually ended up
being aged.

> Unless I'm having a brain fart, x86 can't make use of that functionality.  
> Intel
> doesn't provide any way to do targeted invalidation of stage-2 mappings.  AMD
> provides an instruction to do broadcast invalidations, but it takes a virtual
> address, i.e. a stage-1 address.  I can't tell if it's a host virtual address 
> or
> a guest virtual address, but it's a moot point because KVM doen't have the 
> guest
> virtual address, and if it's a host virtual address, there would need to be 
> valid
> mappings in the host page tables for it to work, which KVM can't guarantee.

Ah, so it sounds like it would need to be an arch opt-in then.

Will


Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-15 Thread Sean Christopherson
On Sat, Apr 13, 2024, Marc Zyngier wrote:
> On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson  
> wrote:
> > 
> > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  wrote:
> > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > would be really great to change the .clear_flush_young() callback so
> > > > that the architecture could handle the TLB invalidation. At the moment,
> > > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > lighter-weight and targetted TLBI in the architecture page-table code
> > > > when we actually update the ptes for small ranges.
> > > 
> > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > costly consequences).
> > > 
> > > In general, it feels like the TLB invalidation should stay with the
> > > code that deals with the page tables, as it has a pretty good idea of
> > > what needs to be invalidated and how -- specially on architectures
> > > that have a HW-broadcast facility like arm64.
> > 
> > Would this be roughly on par with an in-line flush on arm64?  The simpler, 
> > more
> > straightforward solution would be to let architectures override 
> > flush_on_ret,
> > but I would prefer something like the below as x86 can also utilize a 
> > range-based
> > flush when running as a nested hypervisor.

...

> I think this works for us on HW that has range invalidation, which
> would already be a positive move.
> 
> For the lesser HW that isn't range capable, it also gives the
> opportunity to perform the iteration ourselves or go for the nuclear
> option if the range is larger than some arbitrary constant (though
> this is additional work).
> 
> But this still considers the whole range as being affected by
> range->handler(). It'd be interesting to try and see whether more
> precise tracking is (or isn't) generally beneficial.

I assume the idea would be to let arch code do single-page invalidations of
stage-2 entries for each gfn?

Unless I'm having a brain fart, x86 can't make use of that functionality.  Intel
doesn't provide any way to do targeted invalidation of stage-2 mappings.  AMD
provides an instruction to do broadcast invalidations, but it takes a virtual
address, i.e. a stage-1 address.  I can't tell if it's a host virtual address or
a guest virtual address, but it's a moot point because KVM doen't have the guest
virtual address, and if it's a host virtual address, there would need to be 
valid
mappings in the host page tables for it to work, which KVM can't guarantee.


Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-13 Thread Marc Zyngier
On Fri, 12 Apr 2024 15:54:22 +0100,
Sean Christopherson  wrote:
> 
> On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  wrote:
> > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > Also, if you're in the business of hacking the MMU notifier code, it
> > > would be really great to change the .clear_flush_young() callback so
> > > that the architecture could handle the TLB invalidation. At the moment,
> > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > lighter-weight and targetted TLBI in the architecture page-table code
> > > when we actually update the ptes for small ranges.
> > 
> > Indeed, and I was looking at this earlier this week as it has a pretty
> > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > costly consequences).
> > 
> > In general, it feels like the TLB invalidation should stay with the
> > code that deals with the page tables, as it has a pretty good idea of
> > what needs to be invalidated and how -- specially on architectures
> > that have a HW-broadcast facility like arm64.
> 
> Would this be roughly on par with an in-line flush on arm64?  The simpler, 
> more
> straightforward solution would be to let architectures override flush_on_ret,
> but I would prefer something like the below as x86 can also utilize a 
> range-based
> flush when running as a nested hypervisor.
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ff0a20565f90..b65116294efe 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -601,6 +601,7 @@ static __always_inline kvm_mn_ret_t 
> __kvm_handle_hva_range(struct kvm *kvm,
> struct kvm_gfn_range gfn_range;
> struct kvm_memory_slot *slot;
> struct kvm_memslots *slots;
> +   bool need_flush = false;
> int i, idx;
>  
> if (WARN_ON_ONCE(range->end <= range->start))
> @@ -653,10 +654,22 @@ static __always_inline kvm_mn_ret_t 
> __kvm_handle_hva_range(struct kvm *kvm,
> break;
> }
> r.ret |= range->handler(kvm, _range);
> +
> +   /*
> +* Use a precise gfn-based TLB flush when possible, as
> +* most mmu_notifier events affect a small-ish range.
> +* Fall back to a full TLB flush if the gfn-based 
> flush
> +* fails, and don't bother trying the gfn-based flush
> +* if a full flush is already pending.
> +*/
> +   if (range->flush_on_ret && !need_flush && r.ret &&
> +   kvm_arch_flush_remote_tlbs_range(kvm, 
> gfn_range.start
> +gfn_range.end - 
> gfn_range.start + 1))
> +   need_flush = true;
> }
> }
>  
> -   if (range->flush_on_ret && r.ret)
> +   if (need_flush)
> kvm_flush_remote_tlbs(kvm);
>  
> if (r.found_memslot)

I think this works for us on HW that has range invalidation, which
would already be a positive move.

For the lesser HW that isn't range capable, it also gives the
opportunity to perform the iteration ourselves or go for the nuclear
option if the range is larger than some arbitrary constant (though
this is additional work).

But this still considers the whole range as being affected by
range->handler(). It'd be interesting to try and see whether more
precise tracking is (or isn't) generally beneficial.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.


Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-12 Thread David Hildenbrand

On 11.04.24 18:55, Paolo Bonzini wrote:

On Mon, Apr 8, 2024 at 3:56 PM Peter Xu  wrote:

Paolo,

I may miss a bunch of details here (as I still remember some change_pte
patches previously on the list..), however not sure whether we considered
enable it?  Asked because I remember Andrea used to have a custom tree
maintaining that part:

https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968


The patch enables it only for KSM, so it would still require a bunch
of cleanups, for example I also would still use set_pte_at() in all
the places that are not KSM. This would at least fix the issue with
the poor documentation of where to use set_pte_at_notify() vs
set_pte_at().

With regard to the implementation, I like the idea of disabling the
invalidation on the MMU notifier side, but I would rather have
MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of
overloading the event field.


Maybe it can't be enabled for some reason that I overlooked in the current
tree, or we just decided to not to?


I have just learnt about the patch, nobody had ever mentioned it even
though it's almost 2 years old... It's a lot of code though and no one


I assume Andrea used it on his tree where he also has a version of 
"randprotect" (even included in that commit subject) to mitigate a KSM 
security issue that was reported by some security researchers [1] a 
while ago. From what I recall, the industry did not end up caring about 
that security issue that much.


IIUC, with "randprotect" we get a lot more R/O protection even when not 
de-duplicating a page -- thus the name. Likely, the reporter mentioned 
in the commit is a researcher that played with Andreas fix for the 
security issue. But I'm just speculating at this point :)



has ever reported an issue for over 10 years, so I think it's easiest
to just rip the code out.


Yes. Can always be readded in a possibly cleaner fashion (like you note 
above), when deemed necessary and we are willing to support it.


[1] https://gruss.cc/files/remote_dedup.pdf

--
Cheers,

David / dhildenb



Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-12 Thread Sean Christopherson
On Fri, Apr 12, 2024, Marc Zyngier wrote:
> On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  wrote:
> > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > Also, if you're in the business of hacking the MMU notifier code, it
> > would be really great to change the .clear_flush_young() callback so
> > that the architecture could handle the TLB invalidation. At the moment,
> > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> > being set by kvm_handle_hva_range(), whereas we could do a much
> > lighter-weight and targetted TLBI in the architecture page-table code
> > when we actually update the ptes for small ranges.
> 
> Indeed, and I was looking at this earlier this week as it has a pretty
> devastating effect with NV (it blows the shadow S2 for that VMID, with
> costly consequences).
> 
> In general, it feels like the TLB invalidation should stay with the
> code that deals with the page tables, as it has a pretty good idea of
> what needs to be invalidated and how -- specially on architectures
> that have a HW-broadcast facility like arm64.

Would this be roughly on par with an in-line flush on arm64?  The simpler, more
straightforward solution would be to let architectures override flush_on_ret,
but I would prefer something like the below as x86 can also utilize a 
range-based
flush when running as a nested hypervisor.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff0a20565f90..b65116294efe 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -601,6 +601,7 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
struct kvm_gfn_range gfn_range;
struct kvm_memory_slot *slot;
struct kvm_memslots *slots;
+   bool need_flush = false;
int i, idx;
 
if (WARN_ON_ONCE(range->end <= range->start))
@@ -653,10 +654,22 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
break;
}
r.ret |= range->handler(kvm, _range);
+
+   /*
+* Use a precise gfn-based TLB flush when possible, as
+* most mmu_notifier events affect a small-ish range.
+* Fall back to a full TLB flush if the gfn-based flush
+* fails, and don't bother trying the gfn-based flush
+* if a full flush is already pending.
+*/
+   if (range->flush_on_ret && !need_flush && r.ret &&
+   kvm_arch_flush_remote_tlbs_range(kvm, 
gfn_range.start
+gfn_range.end - 
gfn_range.start + 1))
+   need_flush = true;
}
}
 
-   if (range->flush_on_ret && r.ret)
+   if (need_flush)
kvm_flush_remote_tlbs(kvm);
 
if (r.found_memslot)



Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-12 Thread Marc Zyngier
On Fri, 12 Apr 2024 11:44:09 +0100,
Will Deacon  wrote:
> 
> On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index dc04bc767865..ff17849be9f4 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
> > kvm_gfn_range *range)
> > return false;
> >  }
> >  
> > -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > -{
> > -   kvm_pfn_t pfn = pte_pfn(range->arg.pte);
> > -
> > -   if (!kvm->arch.mmu.pgt)
> > -   return false;
> > -
> > -   WARN_ON(range->end - range->start != 1);
> > -
> > -   /*
> > -* If the page isn't tagged, defer to user_mem_abort() for sanitising
> > -* the MTE tags. The S2 pte should have been unmapped by
> > -* mmu_notifier_invalidate_range_end().
> > -*/
> > -   if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
> > -   return false;
> > -
> > -   /*
> > -* We've moved a page around, probably through CoW, so let's treat
> > -* it just like a translation fault and the map handler will clean
> > -* the cache to the PoC.
> > -*
> > -* The MMU notifiers will have unmapped a huge PMD before calling
> > -* ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
> > -* therefore we never need to clear out a huge PMD through this
> > -* calling path and a memcache is not required.
> > -*/
> > -   kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
> > -  PAGE_SIZE, __pfn_to_phys(pfn),
> > -  KVM_PGTABLE_PROT_R, NULL, 0);
> > -
> > -   return false;
> > -}
> > -
> >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> > u64 size = (range->end - range->start) << PAGE_SHIFT;
> 
> Thanks. It's nice to see this code retire:
> 
> Acked-by: Will Deacon 
> 
> Also, if you're in the business of hacking the MMU notifier code, it
> would be really great to change the .clear_flush_young() callback so
> that the architecture could handle the TLB invalidation. At the moment,
> the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> being set by kvm_handle_hva_range(), whereas we could do a much
> lighter-weight and targetted TLBI in the architecture page-table code
> when we actually update the ptes for small ranges.

Indeed, and I was looking at this earlier this week as it has a pretty
devastating effect with NV (it blows the shadow S2 for that VMID, with
costly consequences).

In general, it feels like the TLB invalidation should stay with the
code that deals with the page tables, as it has a pretty good idea of
what needs to be invalidated and how -- specially on architectures
that have a HW-broadcast facility like arm64.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.


Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-12 Thread Will Deacon
On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index dc04bc767865..ff17849be9f4 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
> kvm_gfn_range *range)
>   return false;
>  }
>  
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> - kvm_pfn_t pfn = pte_pfn(range->arg.pte);
> -
> - if (!kvm->arch.mmu.pgt)
> - return false;
> -
> - WARN_ON(range->end - range->start != 1);
> -
> - /*
> -  * If the page isn't tagged, defer to user_mem_abort() for sanitising
> -  * the MTE tags. The S2 pte should have been unmapped by
> -  * mmu_notifier_invalidate_range_end().
> -  */
> - if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
> - return false;
> -
> - /*
> -  * We've moved a page around, probably through CoW, so let's treat
> -  * it just like a translation fault and the map handler will clean
> -  * the cache to the PoC.
> -  *
> -  * The MMU notifiers will have unmapped a huge PMD before calling
> -  * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
> -  * therefore we never need to clear out a huge PMD through this
> -  * calling path and a memcache is not required.
> -  */
> - kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
> -PAGE_SIZE, __pfn_to_phys(pfn),
> -KVM_PGTABLE_PROT_R, NULL, 0);
> -
> - return false;
> -}
> -
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>   u64 size = (range->end - range->start) << PAGE_SHIFT;

Thanks. It's nice to see this code retire:

Acked-by: Will Deacon 

Also, if you're in the business of hacking the MMU notifier code, it
would be really great to change the .clear_flush_young() callback so
that the architecture could handle the TLB invalidation. At the moment,
the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
being set by kvm_handle_hva_range(), whereas we could do a much
lighter-weight and targetted TLBI in the architecture page-table code
when we actually update the ptes for small ranges.

Will


Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-11 Thread Peter Xu
On Thu, Apr 11, 2024 at 06:55:44PM +0200, Paolo Bonzini wrote:
> On Mon, Apr 8, 2024 at 3:56 PM Peter Xu  wrote:
> > Paolo,
> >
> > I may miss a bunch of details here (as I still remember some change_pte
> > patches previously on the list..), however not sure whether we considered
> > enable it?  Asked because I remember Andrea used to have a custom tree
> > maintaining that part:
> >
> > https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968
> 
> The patch enables it only for KSM, so it would still require a bunch
> of cleanups, for example I also would still use set_pte_at() in all
> the places that are not KSM. This would at least fix the issue with
> the poor documentation of where to use set_pte_at_notify() vs
> set_pte_at().
> 
> With regard to the implementation, I like the idea of disabling the
> invalidation on the MMU notifier side, but I would rather have
> MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of
> overloading the event field.
> 
> > Maybe it can't be enabled for some reason that I overlooked in the current
> > tree, or we just decided to not to?
> 
> I have just learnt about the patch, nobody had ever mentioned it even
> though it's almost 2 years old... It's a lot of code though and no one
> has ever reported an issue for over 10 years, so I think it's easiest
> to just rip the code out.

Right, it was pretty old and I have no idea if that was discussed or
published before..  It would be better to have discussed this earlier.

As long as we have a decision with that being aware and in mind, then it
looks fine to me to take either way to go, and I also agree either way is
better than keep the status quo.

I also have Andrea copied anyway when I replied, so I guess he should be
aware of this and he can chim in anytime.

Thanks!

-- 
Peter Xu



Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-11 Thread Paolo Bonzini
On Mon, Apr 8, 2024 at 3:56 PM Peter Xu  wrote:
> Paolo,
>
> I may miss a bunch of details here (as I still remember some change_pte
> patches previously on the list..), however not sure whether we considered
> enable it?  Asked because I remember Andrea used to have a custom tree
> maintaining that part:
>
> https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968

The patch enables it only for KSM, so it would still require a bunch
of cleanups, for example I also would still use set_pte_at() in all
the places that are not KSM. This would at least fix the issue with
the poor documentation of where to use set_pte_at_notify() vs
set_pte_at().

With regard to the implementation, I like the idea of disabling the
invalidation on the MMU notifier side, but I would rather have
MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of
overloading the event field.

> Maybe it can't be enabled for some reason that I overlooked in the current
> tree, or we just decided to not to?

I have just learnt about the patch, nobody had ever mentioned it even
though it's almost 2 years old... It's a lot of code though and no one
has ever reported an issue for over 10 years, so I think it's easiest
to just rip the code out.

Paolo

> Thanks,
>
> --
> Peter Xu
>



Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-08 Thread Peter Xu
On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> The .change_pte() MMU notifier callback was intended as an
> optimization. The original point of it was that KSM could tell KVM to flip
> its secondary PTE to a new location without having to first zap it. At
> the time there was also an .invalidate_page() callback; both of them were
> *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
> and .invalidate_page() also doubled as a fallback implementation of
> .change_pte().
> 
> Later on, however, both callbacks were changed to occur within an
> invalidate_range_start/end() block.
> 
> In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
> set_pte_at_notify with invalidate_range_start and invalidate_range_end",
> 2012-10-09) did so to remove the fallback from .invalidate_page() to
> .change_pte() and allow sleepable .invalidate_page() hooks.
> 
> This however made KVM's usage of the .change_pte() callback completely
> moot, because KVM unmaps the sPTEs during .invalidate_range_start()
> and therefore .change_pte() has no hope of finding a sPTE to change.
> Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
> well as all the architecture specific implementations.

Paolo,

I may miss a bunch of details here (as I still remember some change_pte
patches previously on the list..), however not sure whether we considered
enable it?  Asked because I remember Andrea used to have a custom tree
maintaining that part:

https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968

Maybe it can't be enabled for some reason that I overlooked in the current
tree, or we just decided to not to?

Thanks,

-- 
Peter Xu



Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-08 Thread Michael Ellerman
Paolo Bonzini  writes:
> The .change_pte() MMU notifier callback was intended as an
> optimization. The original point of it was that KSM could tell KVM to flip
> its secondary PTE to a new location without having to first zap it. At
> the time there was also an .invalidate_page() callback; both of them were
> *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
> and .invalidate_page() also doubled as a fallback implementation of
> .change_pte().
>
> Later on, however, both callbacks were changed to occur within an
> invalidate_range_start/end() block.
>
> In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
> set_pte_at_notify with invalidate_range_start and invalidate_range_end",
> 2012-10-09) did so to remove the fallback from .invalidate_page() to
> .change_pte() and allow sleepable .invalidate_page() hooks.
>
> This however made KVM's usage of the .change_pte() callback completely
> moot, because KVM unmaps the sPTEs during .invalidate_range_start()
> and therefore .change_pte() has no hope of finding a sPTE to change.
> Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
> well as all the architecture specific implementations.
>
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/arm64/kvm/mmu.c  | 34 -
>  arch/loongarch/include/asm/kvm_host.h |  1 -
>  arch/loongarch/kvm/mmu.c  | 32 
>  arch/mips/kvm/mmu.c   | 30 ---
>  arch/powerpc/include/asm/kvm_ppc.h|  1 -
>  arch/powerpc/kvm/book3s.c |  5 ---
>  arch/powerpc/kvm/book3s.h |  1 -
>  arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 --
>  arch/powerpc/kvm/book3s_hv.c  |  1 -
>  arch/powerpc/kvm/book3s_pr.c  |  7 
>  arch/powerpc/kvm/e500_mmu_host.c  |  6 ---

LGTM.

Acked-by: Michael Ellerman  (powerpc)

cheers


Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-08 Thread maobibo




On 2024/4/5 下午7:58, Paolo Bonzini wrote:

The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
change_pte().

Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.

In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
change_pte() and allow sleepable .invalidate_page() hooks.

This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.

Signed-off-by: Paolo Bonzini 
---
  arch/arm64/kvm/mmu.c  | 34 -
  arch/loongarch/include/asm/kvm_host.h |  1 -
  arch/loongarch/kvm/mmu.c  | 32 
  arch/mips/kvm/mmu.c   | 30 ---
  arch/powerpc/include/asm/kvm_ppc.h|  1 -
  arch/powerpc/kvm/book3s.c |  5 ---
  arch/powerpc/kvm/book3s.h |  1 -
  arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 --
  arch/powerpc/kvm/book3s_hv.c  |  1 -
  arch/powerpc/kvm/book3s_pr.c  |  7 
  arch/powerpc/kvm/e500_mmu_host.c  |  6 ---
  arch/riscv/kvm/mmu.c  | 20 --
  arch/x86/kvm/mmu/mmu.c| 54 +--
  arch/x86/kvm/mmu/spte.c   | 16 
  arch/x86/kvm/mmu/spte.h   |  2 -
  arch/x86/kvm/mmu/tdp_mmu.c| 46 ---
  arch/x86/kvm/mmu/tdp_mmu.h|  1 -
  include/linux/kvm_host.h  |  2 -
  include/trace/events/kvm.h| 15 
  virt/kvm/kvm_main.c   | 43 -
  20 files changed, 2 insertions(+), 327 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index dc04bc767865..ff17849be9f4 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range)
return false;
  }
  
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)

-{
-   kvm_pfn_t pfn = pte_pfn(range->arg.pte);
-
-   if (!kvm->arch.mmu.pgt)
-   return false;
-
-   WARN_ON(range->end - range->start != 1);
-
-   /*
-* If the page isn't tagged, defer to user_mem_abort() for sanitising
-* the MTE tags. The S2 pte should have been unmapped by
-* mmu_notifier_invalidate_range_end().
-*/
-   if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
-   return false;
-
-   /*
-* We've moved a page around, probably through CoW, so let's treat
-* it just like a translation fault and the map handler will clean
-* the cache to the PoC.
-*
-* The MMU notifiers will have unmapped a huge PMD before calling
-* ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
-* therefore we never need to clear out a huge PMD through this
-* calling path and a memcache is not required.
-*/
-   kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
-  PAGE_SIZE, __pfn_to_phys(pfn),
-  KVM_PGTABLE_PROT_R, NULL, 0);
-
-   return false;
-}
-
  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
  {
u64 size = (range->end - range->start) << PAGE_SHIFT;
diff --git a/arch/loongarch/include/asm/kvm_host.h 
b/arch/loongarch/include/asm/kvm_host.h
index 2d62f7b0d377..69305441f40d 100644
--- a/arch/loongarch/include/asm/kvm_host.h
+++ b/arch/loongarch/include/asm/kvm_host.h
@@ -203,7 +203,6 @@ void kvm_flush_tlb_all(void);
  void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa);
  int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool 
write);
  
-void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);

  int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long 
end, bool blockable);
  int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
  int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
index a556cff35740..98883aa23ab8 100644
--- a/arch/loongarch/kvm/mmu.c
+++ b/arch/loongarch/kvm/mmu.c
@@ -494,38 +494,6 @@ bool kvm_unmap_gfn_range(struct 

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-06 Thread Anup Patel
On Fri, Apr 5, 2024 at 5:28 PM Paolo Bonzini  wrote:
>
> The .change_pte() MMU notifier callback was intended as an
> optimization. The original point of it was that KSM could tell KVM to flip
> its secondary PTE to a new location without having to first zap it. At
> the time there was also an .invalidate_page() callback; both of them were
> *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
> and .invalidate_page() also doubled as a fallback implementation of
> .change_pte().
>
> Later on, however, both callbacks were changed to occur within an
> invalidate_range_start/end() block.
>
> In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
> set_pte_at_notify with invalidate_range_start and invalidate_range_end",
> 2012-10-09) did so to remove the fallback from .invalidate_page() to
> .change_pte() and allow sleepable .invalidate_page() hooks.
>
> This however made KVM's usage of the .change_pte() callback completely
> moot, because KVM unmaps the sPTEs during .invalidate_range_start()
> and therefore .change_pte() has no hope of finding a sPTE to change.
> Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
> well as all the architecture specific implementations.
>
> Signed-off-by: Paolo Bonzini 

For KVM RISC-V:
Acked-by: Anup Patel 

Regards,
Anup

> ---
>  arch/arm64/kvm/mmu.c  | 34 -
>  arch/loongarch/include/asm/kvm_host.h |  1 -
>  arch/loongarch/kvm/mmu.c  | 32 
>  arch/mips/kvm/mmu.c   | 30 ---
>  arch/powerpc/include/asm/kvm_ppc.h|  1 -
>  arch/powerpc/kvm/book3s.c |  5 ---
>  arch/powerpc/kvm/book3s.h |  1 -
>  arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 --
>  arch/powerpc/kvm/book3s_hv.c  |  1 -
>  arch/powerpc/kvm/book3s_pr.c  |  7 
>  arch/powerpc/kvm/e500_mmu_host.c  |  6 ---
>  arch/riscv/kvm/mmu.c  | 20 --
>  arch/x86/kvm/mmu/mmu.c| 54 +--
>  arch/x86/kvm/mmu/spte.c   | 16 
>  arch/x86/kvm/mmu/spte.h   |  2 -
>  arch/x86/kvm/mmu/tdp_mmu.c| 46 ---
>  arch/x86/kvm/mmu/tdp_mmu.h|  1 -
>  include/linux/kvm_host.h  |  2 -
>  include/trace/events/kvm.h| 15 
>  virt/kvm/kvm_main.c   | 43 -
>  20 files changed, 2 insertions(+), 327 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index dc04bc767865..ff17849be9f4 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
> kvm_gfn_range *range)
> return false;
>  }
>
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> -   kvm_pfn_t pfn = pte_pfn(range->arg.pte);
> -
> -   if (!kvm->arch.mmu.pgt)
> -   return false;
> -
> -   WARN_ON(range->end - range->start != 1);
> -
> -   /*
> -* If the page isn't tagged, defer to user_mem_abort() for sanitising
> -* the MTE tags. The S2 pte should have been unmapped by
> -* mmu_notifier_invalidate_range_end().
> -*/
> -   if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
> -   return false;
> -
> -   /*
> -* We've moved a page around, probably through CoW, so let's treat
> -* it just like a translation fault and the map handler will clean
> -* the cache to the PoC.
> -*
> -* The MMU notifiers will have unmapped a huge PMD before calling
> -* ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
> -* therefore we never need to clear out a huge PMD through this
> -* calling path and a memcache is not required.
> -*/
> -   kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
> -  PAGE_SIZE, __pfn_to_phys(pfn),
> -  KVM_PGTABLE_PROT_R, NULL, 0);
> -
> -   return false;
> -}
> -
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
> u64 size = (range->end - range->start) << PAGE_SHIFT;
> diff --git a/arch/loongarch/include/asm/kvm_host.h 
> b/arch/loongarch/include/asm/kvm_host.h
> index 2d62f7b0d377..69305441f40d 100644
> --- a/arch/loongarch/include/asm/kvm_host.h
> +++ b/arch/loongarch/include/asm/kvm_host.h
> @@ -203,7 +203,6 @@ void kvm_flush_tlb_all(void);
>  void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa);
>  int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool 
> write);
>
> -void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
>  int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long 
> end, bool blockable);
>  int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
>  int kvm_test_age_hva(struct kvm *kvm, 

[PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-05 Thread Paolo Bonzini
The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().

Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.

In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.

This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.

Signed-off-by: Paolo Bonzini 
---
 arch/arm64/kvm/mmu.c  | 34 -
 arch/loongarch/include/asm/kvm_host.h |  1 -
 arch/loongarch/kvm/mmu.c  | 32 
 arch/mips/kvm/mmu.c   | 30 ---
 arch/powerpc/include/asm/kvm_ppc.h|  1 -
 arch/powerpc/kvm/book3s.c |  5 ---
 arch/powerpc/kvm/book3s.h |  1 -
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 --
 arch/powerpc/kvm/book3s_hv.c  |  1 -
 arch/powerpc/kvm/book3s_pr.c  |  7 
 arch/powerpc/kvm/e500_mmu_host.c  |  6 ---
 arch/riscv/kvm/mmu.c  | 20 --
 arch/x86/kvm/mmu/mmu.c| 54 +--
 arch/x86/kvm/mmu/spte.c   | 16 
 arch/x86/kvm/mmu/spte.h   |  2 -
 arch/x86/kvm/mmu/tdp_mmu.c| 46 ---
 arch/x86/kvm/mmu/tdp_mmu.h|  1 -
 include/linux/kvm_host.h  |  2 -
 include/trace/events/kvm.h| 15 
 virt/kvm/kvm_main.c   | 43 -
 20 files changed, 2 insertions(+), 327 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index dc04bc767865..ff17849be9f4 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range)
return false;
 }
 
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-   kvm_pfn_t pfn = pte_pfn(range->arg.pte);
-
-   if (!kvm->arch.mmu.pgt)
-   return false;
-
-   WARN_ON(range->end - range->start != 1);
-
-   /*
-* If the page isn't tagged, defer to user_mem_abort() for sanitising
-* the MTE tags. The S2 pte should have been unmapped by
-* mmu_notifier_invalidate_range_end().
-*/
-   if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
-   return false;
-
-   /*
-* We've moved a page around, probably through CoW, so let's treat
-* it just like a translation fault and the map handler will clean
-* the cache to the PoC.
-*
-* The MMU notifiers will have unmapped a huge PMD before calling
-* ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
-* therefore we never need to clear out a huge PMD through this
-* calling path and a memcache is not required.
-*/
-   kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
-  PAGE_SIZE, __pfn_to_phys(pfn),
-  KVM_PGTABLE_PROT_R, NULL, 0);
-
-   return false;
-}
-
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
u64 size = (range->end - range->start) << PAGE_SHIFT;
diff --git a/arch/loongarch/include/asm/kvm_host.h 
b/arch/loongarch/include/asm/kvm_host.h
index 2d62f7b0d377..69305441f40d 100644
--- a/arch/loongarch/include/asm/kvm_host.h
+++ b/arch/loongarch/include/asm/kvm_host.h
@@ -203,7 +203,6 @@ void kvm_flush_tlb_all(void);
 void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa);
 int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool write);
 
-void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long 
end, bool blockable);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
index a556cff35740..98883aa23ab8 100644
--- a/arch/loongarch/kvm/mmu.c
+++ b/arch/loongarch/kvm/mmu.c
@@ -494,38 +494,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range)
range->end <<