Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-04-19 Thread Sean Christopherson
On Tue, Apr 20, 2021, Paolo Bonzini wrote:
> On 19/04/21 17:09, Sean Christopherson wrote:
> > > - this loses the rwsem fairness.  On the other hand, mm/mmu_notifier.c's
> > > own interval-tree-based filter is also using a similar mechanism that is
> > > likewise not fair, so it should be okay.
> > 
> > The one concern I had with an unfair mechanism of this nature is that, in 
> > theory,
> > the memslot update could be blocked indefinitely.
> 
> Yep, that's why I mentioned it.
> 
> > > @@ -1333,9 +1351,22 @@ static struct kvm_memslots 
> > > *install_new_memslots(struct kvm *kvm,
> > >   WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
> > >   slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
> > > - down_write(>mmu_notifier_slots_lock);
> > > + /*
> > > +  * This cannot be an rwsem because the MMU notifier must not run
> > > +  * inside the critical section.  A sleeping rwsem cannot exclude
> > > +  * that.
> > 
> > How on earth did you decipher that from the splat?  I stared at it for a 
> > good
> > five minutes and was completely befuddled.
> 
> Just scratch that, it makes no sense.  It's much simpler, but you have
> to look at include/linux/mmu_notifier.h to figure it out:

LOL, glad you could figure it out, I wasn't getting anywhere, mmu_notifier.h or
not.

> invalidate_range_start
>   take pseudo lock
>   down_read()   (*)
>   release pseudo lock
> invalidate_range_end
>   take pseudo lock  (**)
>   up_read()
>   release pseudo lock
> 
> At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
> at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
> 
> This could cause a deadlock (ignoring for a second that the pseudo lock
> is not a lock):
> 
> - invalidate_range_start waits on down_read(), because the rwsem is
> held by install_new_memslots
> 
> - install_new_memslots waits on down_write(), because the rwsem is
> held till (another) invalidate_range_end finishes
> 
> - invalidate_range_end sits waits on the pseudo lock, held by
> invalidate_range_start.
> 
> Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
> it would change the *shared* rwsem readers into *shared recursive*
> readers).  This also means that there's no need for a raw spinlock.

Ahh, thanks, this finally made things click.

> Given this simple explanation, I think it's okay to include this

LOL, "simple".

> patch in the merge window pull request, with the fix after my
> signature squashed in.  The fix actually undoes a lot of the
> changes to __kvm_handle_hva_range that this patch made, so the
> result is relatively simple.  You can already find the result
> in kvm/queue.

...

>  static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> const struct kvm_hva_range 
> *range)
>  {
> @@ -515,10 +495,6 @@ static __always_inline int __kvm_handle_hva_range(struct 
> kvm *kvm,
>   idx = srcu_read_lock(>srcu);
> - if (range->must_lock &&
> - kvm_mmu_lock_and_check_handler(kvm, range, ))
> - goto out_unlock;
> -
>   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>   slots = __kvm_memslots(kvm, i);
>   kvm_for_each_memslot(slot, slots) {
> @@ -547,8 +523,14 @@ static __always_inline int __kvm_handle_hva_range(struct 
> kvm *kvm,
>   gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE 
> - 1, slot);
>   gfn_range.slot = slot;
> - if (kvm_mmu_lock_and_check_handler(kvm, range, ))
> - goto out_unlock;
> + if (!locked) {
> + locked = true;
> + KVM_MMU_LOCK(kvm);
> + if (!IS_KVM_NULL_FN(range->on_lock))
> + range->on_lock(kvm, range->start, 
> range->end);
> + if (IS_KVM_NULL_FN(range->handler))
> + break;

This can/should be "goto out_unlock", "break" only takes us out of the memslots
walk, we want to get out of the address space loop.  Not a functional problem,
but we might walk all SMM memslots unnecessarily.

> + }
>   ret |= range->handler(kvm, _range);
>   }
> @@ -557,7 +539,6 @@ static __always_inline int __kvm_handle_hva_range(struct 
> kvm *kvm,
>   if (range->flush_on_ret && (ret || kvm->tlbs_dirty))
>   kvm_flush_remote_tlbs(kvm);
> -out_unlock:
>   if (locked)
>   KVM_MMU_UNLOCK(kvm);
> @@ -580,7 +561,6 @@ static __always_inline int kvm_handle_hva_range(struct 
> mmu_notifier *mn,
>   .pte= pte,
>   .handler= handler,
>   .on_lock= (void *)kvm_null_fn,
> - .must_lock  = false,
>   .flush_on_ret   = true,
>   

Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-04-19 Thread Paolo Bonzini

On 19/04/21 17:09, Sean Christopherson wrote:

- this loses the rwsem fairness.  On the other hand, mm/mmu_notifier.c's
own interval-tree-based filter is also using a similar mechanism that is
likewise not fair, so it should be okay.


The one concern I had with an unfair mechanism of this nature is that, in 
theory,
the memslot update could be blocked indefinitely.


Yep, that's why I mentioned it.


@@ -1333,9 +1351,22 @@ static struct kvm_memslots *install_new_memslots(struct 
kvm *kvm,
WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
-   down_write(>mmu_notifier_slots_lock);
+   /*
+* This cannot be an rwsem because the MMU notifier must not run
+* inside the critical section.  A sleeping rwsem cannot exclude
+* that.


How on earth did you decipher that from the splat?  I stared at it for a good
five minutes and was completely befuddled.


Just scratch that, it makes no sense.  It's much simpler, but you have
to look at include/linux/mmu_notifier.h to figure it out:

invalidate_range_start
  take pseudo lock
  down_read()   (*)
  release pseudo lock
invalidate_range_end
  take pseudo lock  (**)
  up_read()
  release pseudo lock

At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.

This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):

- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots

- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes

- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.

Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers).  This also means that there's no need for a raw spinlock.

Given this simple explanation, I think it's okay to include this
patch in the merge window pull request, with the fix after my
signature squashed in.  The fix actually undoes a lot of the
changes to __kvm_handle_hva_range that this patch made, so the
result is relatively simple.  You can already find the result
in kvm/queue.

Paolo

From daefeeb229ba8be5bd819a51875bc1fd5e74fc85 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini 
Date: Mon, 19 Apr 2021 09:01:46 -0400
Subject: [PATCH] KVM: avoid "deadlock" between install_new_memslots and MMU
 notifier

Wanpeng Li is reporting this splat:

 ==
 WARNING: possible circular locking dependency detected
 5.12.0-rc3+ #6 Tainted: G   OE
 --
 qemu-system-x86/3069 is trying to acquire lock:
 9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: 
__mmu_notifier_invalidate_range_end+0x5/0x190

 but task is already holding lock:
 aff7410a9160 (>mmu_notifier_slots_lock){.+.+}-{3:3}, at: 
kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]

 which lock already depends on the new lock.

This corresponds to the following MMU notifier logic:

invalidate_range_start
  take pseudo lock
  down_read()   (*)
  release pseudo lock
invalidate_range_end
  take pseudo lock  (**)
  up_read()
  release pseudo lock

At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.

This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):

- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots

- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes

- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.

Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock.  This also allows handling blockable and non-blockable
critical section in the same way.

Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever.  Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.

Signed-off-by: Paolo Bonzini 
---
 Documentation/virt/kvm/locking.rst |   9 +--
 include/linux/kvm_host.h   |   8 +-
 virt/kvm/kvm_main.c| 119 ++---
 3 files changed, 67 insertions(+), 69 deletions(-)

diff --git a/Documentation/virt/kvm/locking.rst 
b/Documentation/virt/kvm/locking.rst
index 8f5d5bcf5689..e628f48dfdda 100644
--- a/Documentation/virt/kvm/locking.rst
+++ 

Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-04-19 Thread Sean Christopherson
On Mon, Apr 19, 2021, Paolo Bonzini wrote:
> On 19/04/21 10:49, Wanpeng Li wrote:
> > I saw this splatting:
> > 
> >   ==
> >   WARNING: possible circular locking dependency detected
> >   5.12.0-rc3+ #6 Tainted: G   OE
> >   --
> >   qemu-system-x86/3069 is trying to acquire lock:
> >   9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0},
> > at: __mmu_notifier_invalidate_range_end+0x5/0x190
> > 
> >   but task is already holding lock:
> >   aff7410a9160 (>mmu_notifier_slots_lock){.+.+}-{3:3}, at:
> > kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
> 
> I guess it is possible to open-code the wait using a readers count and a
> spinlock (see patch after signature).  This allows including the
> rcu_assign_pointer in the same critical section that checks the number
> of readers.  Also on the plus side, the init_rwsem() is replaced by
> slightly nicer code.

Ugh, the count approach is nearly identical to Ben's original code.  Using a
rwsem seemed so clever :-/

> IIUC this could be extended to non-sleeping invalidations too, but I
> am not really sure about that.

Yes, that should be fine.

> There are some issues with the patch though:
> 
> - I am not sure if this should be a raw spin lock to avoid the same issue
> on PREEMPT_RT kernel.  That said the critical section is so tiny that using
> a raw spin lock may make sense anyway

If using spinlock_t is problematic, wouldn't mmu_lock already be an issue?  Or
am I misunderstanding your concern?

> - this loses the rwsem fairness.  On the other hand, mm/mmu_notifier.c's
> own interval-tree-based filter is also using a similar mechanism that is
> likewise not fair, so it should be okay.

The one concern I had with an unfair mechanism of this nature is that, in 
theory,
the memslot update could be blocked indefinitely.

> Any opinions?  For now I placed the change below in kvm/queue, but I'm
> leaning towards delaying this optimization to the next merge window.

I think delaying it makes sense.

> @@ -1333,9 +1351,22 @@ static struct kvm_memslots 
> *install_new_memslots(struct kvm *kvm,
>   WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
>   slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
> - down_write(>mmu_notifier_slots_lock);
> + /*
> +  * This cannot be an rwsem because the MMU notifier must not run
> +  * inside the critical section.  A sleeping rwsem cannot exclude
> +  * that.

How on earth did you decipher that from the splat?  I stared at it for a good
five minutes and was completely befuddled.

> +  */
> + spin_lock(>mn_invalidate_lock);
> + prepare_to_rcuwait(>mn_memslots_update_rcuwait);
> + while (kvm->mn_active_invalidate_count) {
> + set_current_state(TASK_UNINTERRUPTIBLE);
> + spin_unlock(>mn_invalidate_lock);
> + schedule();
> + spin_lock(>mn_invalidate_lock);
> + }
> + finish_rcuwait(>mn_memslots_update_rcuwait);
>   rcu_assign_pointer(kvm->memslots[as_id], slots);
> - up_write(>mmu_notifier_slots_lock);
> + spin_unlock(>mn_invalidate_lock);
>   synchronize_srcu_expedited(>srcu);
> 


Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-04-19 Thread Paolo Bonzini

On 19/04/21 10:49, Wanpeng Li wrote:

I saw this splatting:

  ==
  WARNING: possible circular locking dependency detected
  5.12.0-rc3+ #6 Tainted: G   OE
  --
  qemu-system-x86/3069 is trying to acquire lock:
  9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0},
at: __mmu_notifier_invalidate_range_end+0x5/0x190

  but task is already holding lock:
  aff7410a9160 (>mmu_notifier_slots_lock){.+.+}-{3:3}, at:
kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]


I guess it is possible to open-code the wait using a readers count and a
spinlock (see patch after signature).  This allows including the
rcu_assign_pointer in the same critical section that checks the number
of readers.  Also on the plus side, the init_rwsem() is replaced by
slightly nicer code.

IIUC this could be extended to non-sleeping invalidations too, but I
am not really sure about that.

There are some issues with the patch though:

- I am not sure if this should be a raw spin lock to avoid the same issue
on PREEMPT_RT kernel.  That said the critical section is so tiny that using
a raw spin lock may make sense anyway

- this loses the rwsem fairness.  On the other hand, mm/mmu_notifier.c's
own interval-tree-based filter is also using a similar mechanism that is
likewise not fair, so it should be okay.

Any opinions?  For now I placed the change below in kvm/queue, but I'm
leaning towards delaying this optimization to the next merge window.

Paolo

diff --git a/Documentation/virt/kvm/locking.rst 
b/Documentation/virt/kvm/locking.rst
index 8f5d5bcf5689..e628f48dfdda 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -16,12 +16,11 @@ The acquisition orders for mutexes are as follows:
 - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
   them together is quite rare.
 
-- The kvm->mmu_notifier_slots_lock rwsem ensures that pairs of

+- kvm->mn_active_invalidate_count ensures that pairs of
   invalidate_range_start() and invalidate_range_end() callbacks
-  use the same memslots array.  kvm->slots_lock is taken outside the
-  write-side critical section of kvm->mmu_notifier_slots_lock, so
-  MMU notifiers must not take kvm->slots_lock.  No other write-side
-  critical sections should be added.
+  use the same memslots array.  kvm->slots_lock is taken on the
+  waiting side in install_new_memslots, so MMU notifiers must not
+  take kvm->slots_lock.
 
 On x86:
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h

index 76b340dd6981..44a4a0c5148a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -472,11 +472,15 @@ struct kvm {
 #endif /* KVM_HAVE_MMU_RWLOCK */
 
 	struct mutex slots_lock;

-   struct rw_semaphore mmu_notifier_slots_lock;
struct mm_struct *mm; /* userspace tied to this vm */
struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 
+	/* Used to wait for completion of MMU notifiers.  */

+   spinlock_t mn_invalidate_lock;
+   unsigned long mn_active_invalidate_count;
+   struct rcuwait mn_memslots_update_rcuwait;
+
/*
 * created_vcpus is protected by kvm->lock, and is incremented
 * at the beginning of KVM_CREATE_VCPU.  online_vcpus is only
@@ -662,7 +666,7 @@ static inline struct kvm_memslots *__kvm_memslots(struct 
kvm *kvm, int as_id)
as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
return srcu_dereference_check(kvm->memslots[as_id], >srcu,
  lockdep_is_held(>slots_lock) ||
- 
lockdep_is_held(>mmu_notifier_slots_lock) ||
+ 
READ_ONCE(kvm->mn_active_invalidate_count) ||
  !refcount_read(>users_count));
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c

index ff9e95eb6960..cdaa1841e725 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -624,7 +624,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier 
*mn,
 * otherwise, mmu_notifier_count is incremented unconditionally.
 */
if (!kvm->mmu_notifier_count) {
-   lockdep_assert_held(>mmu_notifier_slots_lock);
+   WARN_ON(!READ_ONCE(kvm->mn_active_invalidate_count));
return;
}
 
@@ -689,10 +689,13 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,

 * The complexity required to handle conditional locking for this case
 * is not worth the marginal benefits, the VM is likely doomed anyways.
 *
-* Pairs with the up_read in range_end().
+* Pairs with the decrement in range_end().
 */
-   if (blockable)
-   down_read(>mmu_notifier_slots_lock);
+   if (blockable) {
+   

Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-04-19 Thread Wanpeng Li
On Fri, 2 Apr 2021 at 08:59, Sean Christopherson  wrote:
>
> Avoid taking mmu_lock for unrelated .invalidate_range_{start,end}()
> notifications.  Because mmu_notifier_count must be modified while holding
> mmu_lock for write, and must always be paired across start->end to stay
> balanced, lock elision must happen in both or none.  To meet that
> requirement, add a rwsem to prevent memslot updates across range_start()
> and range_end().
>
> Use a rwsem instead of a rwlock since most notifiers _allow_ blocking,
> and the lock will be endl across the entire start() ... end() sequence.
> If anything in the sequence sleeps, including the caller or a different
> notifier, holding the spinlock would be disastrous.
>
> For notifiers that _disallow_ blocking, e.g. OOM reaping, simply go down
> the slow path of unconditionally acquiring mmu_lock.  The sane
> alternative would be to try to acquire the lock and force the notifier
> to retry on failure.  But since OOM is currently the _only_ scenario
> where blocking is disallowed attempting to optimize a guest that has been
> marked for death is pointless.
>
> Unconditionally define and use mmu_notifier_slots_lock in the memslots
> code, purely to avoid more #ifdefs.  The overhead of acquiring the lock
> is negligible when the lock is uncontested, which will always be the case
> when the MMU notifiers are not used.
>
> Note, technically flag-only memslot updates could be allowed in parallel,
> but stalling a memslot update for a relatively short amount of time is
> not a scalability issue, and this is all more than complex enough.
>
> Based heavily on code from Ben Gardon.
>
> Suggested-by: Ben Gardon 
> Signed-off-by: Sean Christopherson 

I saw this splatting:

 ==
 WARNING: possible circular locking dependency detected
 5.12.0-rc3+ #6 Tainted: G   OE
 --
 qemu-system-x86/3069 is trying to acquire lock:
 9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0},
at: __mmu_notifier_invalidate_range_end+0x5/0x190

 but task is already holding lock:
 aff7410a9160 (>mmu_notifier_slots_lock){.+.+}-{3:3}, at:
kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]

 which lock already depends on the new lock.


 the existing dependency chain (in reverse order) is:

 -> #1 (>mmu_notifier_slots_lock){.+.+}-{3:3}:
down_read+0x48/0x250
kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
__mmu_notifier_invalidate_range_start+0xe8/0x260
wp_page_copy+0x82b/0xa30
do_wp_page+0xde/0x420
__handle_mm_fault+0x935/0x1230
handle_mm_fault+0x179/0x420
do_user_addr_fault+0x1b3/0x690
exc_page_fault+0x82/0x2b0
asm_exc_page_fault+0x1e/0x30

 -> #0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}:
__lock_acquire+0x110f/0x1980
lock_acquire+0x1bc/0x400
__mmu_notifier_invalidate_range_end+0x47/0x190
wp_page_copy+0x796/0xa30
do_wp_page+0xde/0x420
__handle_mm_fault+0x935/0x1230
handle_mm_fault+0x179/0x420
do_user_addr_fault+0x1b3/0x690
exc_page_fault+0x82/0x2b0
asm_exc_page_fault+0x1e/0x30

 other info that might help us debug this:

  Possible unsafe locking scenario:

CPU0CPU1

   lock(>mmu_notifier_slots_lock);
lock(mmu_notifier_invalidate_range_start);
lock(>mmu_notifier_slots_lock);
   lock(mmu_notifier_invalidate_range_start);

  *** DEADLOCK ***

 2 locks held by qemu-system-x86/3069:
  #0: 9e4269f8a9e0 (>mmap_lock#2){}-{3:3}, at:
do_user_addr_fault+0x10e/0x690
  #1: aff7410a9160 (>mmu_notifier_slots_lock){.+.+}-{3:3},
at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]

 stack backtrace:
 CPU: 0 PID: 3069 Comm: qemu-system-x86 Tainted: G   OE
5.12.0-rc3+ #6
 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS
FBKTC1AUS 02/16/2016
 Call Trace:
  dump_stack+0x87/0xb7
  print_circular_bug.isra.39+0x1b4/0x210
  check_noncircular+0x103/0x150
  __lock_acquire+0x110f/0x1980
  ? __lock_acquire+0x110f/0x1980
  lock_acquire+0x1bc/0x400
  ? __mmu_notifier_invalidate_range_end+0x5/0x190
  ? find_held_lock+0x40/0xb0
  __mmu_notifier_invalidate_range_end+0x47/0x190
  ? __mmu_notifier_invalidate_range_end+0x5/0x190
  wp_page_copy+0x796/0xa30
  do_wp_page+0xde/0x420
  __handle_mm_fault+0x935/0x1230
  handle_mm_fault+0x179/0x420
  do_user_addr_fault+0x1b3/0x690
  ? rcu_read_lock_sched_held+0x4f/0x80
  exc_page_fault+0x82/0x2b0
  ? asm_exc_page_fault+0x8/0x30
  asm_exc_page_fault+0x1e/0x30
 RIP: 0033:0x55f5bef2560f


Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-04-02 Thread Sean Christopherson
On Fri, Apr 02, 2021, Paolo Bonzini wrote:
> On 02/04/21 02:56, Sean Christopherson wrote:
> > Avoid taking mmu_lock for unrelated .invalidate_range_{start,end}()
> > notifications.  Because mmu_notifier_count must be modified while holding
> > mmu_lock for write, and must always be paired across start->end to stay
> > balanced, lock elision must happen in both or none.  To meet that
> > requirement, add a rwsem to prevent memslot updates across range_start()
> > and range_end().
> > 
> > Use a rwsem instead of a rwlock since most notifiers _allow_ blocking,
> > and the lock will be endl across the entire start() ... end() sequence.
> > If anything in the sequence sleeps, including the caller or a different
> > notifier, holding the spinlock would be disastrous.
> > 
> > For notifiers that _disallow_ blocking, e.g. OOM reaping, simply go down
> > the slow path of unconditionally acquiring mmu_lock.  The sane
> > alternative would be to try to acquire the lock and force the notifier
> > to retry on failure.  But since OOM is currently the _only_ scenario
> > where blocking is disallowed attempting to optimize a guest that has been
> > marked for death is pointless.
> > 
> > Unconditionally define and use mmu_notifier_slots_lock in the memslots
> > code, purely to avoid more #ifdefs.  The overhead of acquiring the lock
> > is negligible when the lock is uncontested, which will always be the case
> > when the MMU notifiers are not used.
> > 
> > Note, technically flag-only memslot updates could be allowed in parallel,
> > but stalling a memslot update for a relatively short amount of time is
> > not a scalability issue, and this is all more than complex enough.
> 
> Proposal for the locking documentation:

Argh, sorry!  Looks great, I owe you.

> diff --git a/Documentation/virt/kvm/locking.rst 
> b/Documentation/virt/kvm/locking.rst
> index b21a34c34a21..3e4ad7de36cb 100644
> --- a/Documentation/virt/kvm/locking.rst
> +++ b/Documentation/virt/kvm/locking.rst
> @@ -16,6 +16,13 @@ The acquisition orders for mutexes are as follows:
>  - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
>them together is quite rare.
> +- The kvm->mmu_notifier_slots_lock rwsem ensures that pairs of
> +  invalidate_range_start() and invalidate_range_end() callbacks
> +  use the same memslots array.  kvm->slots_lock is taken outside the
> +  write-side critical section of kvm->mmu_notifier_slots_lock, so
> +  MMU notifiers must not take kvm->slots_lock.  No other write-side
> +  critical sections should be added.
> +
>  On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
>  Everything else is a leaf: no other lock is taken inside the critical
> 
> Paolo
> 


Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-04-02 Thread Paolo Bonzini

On 02/04/21 02:56, Sean Christopherson wrote:

Avoid taking mmu_lock for unrelated .invalidate_range_{start,end}()
notifications.  Because mmu_notifier_count must be modified while holding
mmu_lock for write, and must always be paired across start->end to stay
balanced, lock elision must happen in both or none.  To meet that
requirement, add a rwsem to prevent memslot updates across range_start()
and range_end().

Use a rwsem instead of a rwlock since most notifiers _allow_ blocking,
and the lock will be endl across the entire start() ... end() sequence.
If anything in the sequence sleeps, including the caller or a different
notifier, holding the spinlock would be disastrous.

For notifiers that _disallow_ blocking, e.g. OOM reaping, simply go down
the slow path of unconditionally acquiring mmu_lock.  The sane
alternative would be to try to acquire the lock and force the notifier
to retry on failure.  But since OOM is currently the _only_ scenario
where blocking is disallowed attempting to optimize a guest that has been
marked for death is pointless.

Unconditionally define and use mmu_notifier_slots_lock in the memslots
code, purely to avoid more #ifdefs.  The overhead of acquiring the lock
is negligible when the lock is uncontested, which will always be the case
when the MMU notifiers are not used.

Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.


Proposal for the locking documentation:

diff --git a/Documentation/virt/kvm/locking.rst 
b/Documentation/virt/kvm/locking.rst
index b21a34c34a21..3e4ad7de36cb 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -16,6 +16,13 @@ The acquisition orders for mutexes are as follows:
 - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
   them together is quite rare.
 
+- The kvm->mmu_notifier_slots_lock rwsem ensures that pairs of

+  invalidate_range_start() and invalidate_range_end() callbacks
+  use the same memslots array.  kvm->slots_lock is taken outside the
+  write-side critical section of kvm->mmu_notifier_slots_lock, so
+  MMU notifiers must not take kvm->slots_lock.  No other write-side
+  critical sections should be added.
+
 On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
 
 Everything else is a leaf: no other lock is taken inside the critical


Paolo