Re: [PATCH v1 08/16] kvm: arm/arm64: Clean up stage2 pgd life time

2018-02-09 Thread Christoffer Dall
On Thu, Feb 08, 2018 at 05:19:22PM +, Suzuki K Poulose wrote:
> On 08/02/18 11:00, Christoffer Dall wrote:
> >On Tue, Jan 09, 2018 at 07:04:03PM +, Suzuki K Poulose wrote:
> >>On arm/arm64 we pre-allocate the entry level page tables when
> >>a VM is created and is free'd when either all the mm users are
> >>gone or the KVM is about to get destroyed. i.e, kvm_free_stage2_pgd
> >>is triggered via kvm_arch_flush_shadow_all() which can be invoked
> >>from two different paths :
> >>
> >>  1) do_exit()-> .-> mmu_notifier->release()-> ..-> 
> >> kvm_arch_flush_shadow_all()
> >> OR
> >>  2) kvm_destroy_vm()-> mmu_notifier_unregister-> 
> >> kvm_arch_flush_shadow_all()
> >>
> >>This has created lot of race conditions in the past as some of
> >>the VCPUs could be active when we free the stage2 via path (1).
> >
> >How??  mmu_notifier->release() is called via __mput->exit_mmap(), which
> >is only called if mm_users == 0, which means there are no more threads
> >left than the one currently doing exit().
> 
> IIRC, if the process is sent a fatal signal, that could cause all the threads
> to exit, leaving the "last" thread to do the clean up. The files could still
> be open, implying that the KVM fds are still active, without a stage2, even
> though we are not going to run anything. (The race was fixed by moving the
> stage2 teardown to mmu_notifier->release()).
> 
> 

Hmm, if the last thread is do_exit(), by definition there can't be any
other VCPU thread (because then it wouldn't be the last one) and
therefore only this last exiting thread can have the fd open, and since
it's in the middle of do_exit(), it will close the fds before anything
will have chance to run.

> >
> >>
> >>On a closer look, all we need to do with kvm_arch_flush_shadow_all() is,
> >>to ensure that the stage2 mappings are cleared. This doesn't mean we
> >>have to free up the stage2 entry level page tables yet, which could
> >>be delayed until the kvm is destroyed. This would avoid issues
> >>of use-after-free,
> >
> >do we have any of those left?
> 
> None that we know of.
> 

Then I think this commit text is misleading and pretty confusing.  If
we have a correct implementation, but we want to clean something up,
then this commit message shouldn't talk about races or use-after-free,
it should just say that we rearrange code to change the flow, and
describe why/how.

> >
> >>This will be later used for delaying
> >>the allocation of the stage2 entry level page tables until we really
> >>need to do something with it.
> >
> >Fine, but you don't actually explain why this change of flow is
> >necessary for what you're trying to do later?
> 
> This patch is not mandatory for the series. But, since we are delaying
> the "allocation" stage2 tables anyway later, I thought it would be
> good to clean up the "free" path.
> 

Hmm, I'm not really sure it is a cleanup.  In any case, the motivation
for this change should be clear.  I do like the idea of getting read of
the kvm->arch.pgd checks in the various stage2 manipulation functions.

> >>  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>index 78253fe00fc4..c94c61ac38b9 100644
> >>--- a/virt/kvm/arm/mmu.c
> >>+++ b/virt/kvm/arm/mmu.c
> >>@@ -298,11 +298,10 @@ static void unmap_stage2_range(struct kvm *kvm, 
> >>phys_addr_t start, u64 size)
> >>pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> >>do {
> >>/*
> >>-* Make sure the page table is still active, as another thread
> >>-* could have possibly freed the page table, while we released
> >>-* the lock.
> >>+* The page table shouldn't be free'd as we still hold a 
> >>reference
> >>+* to the KVM.
> >
> >To avoid confusion about references to the kernel module KVM and a
> >specific KVM VM instance, please s/KVM/VM/.
> 
> ok.
> 
> >
> >> */
> >>-   if (!READ_ONCE(kvm->arch.pgd))
> >>+   if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))
> >
> >This reads a lot like a defensive implementation now, and I think for
> >this patch to make sense, we shouldn't try to handle a buggy super-racy
> >implementation gracefully, but rather have VM_BUG_ON() (if we care about
> >performance of the check) or simply BUG_ON().
> >
> >The rationale being that if we've gotten this flow incorrect and freed
> >the pgd at the wrong time, we don't want to leave a ticking bomb to blow
> >up somewhere else randomly (which it will!), but instead crash and burn.
> 
> Ok.
> 
> >
> >>break;
> >>next = stage2_pgd_addr_end(addr, end);
> >>if (!stage2_pgd_none(*pgd))
> >>@@ -837,30 +836,33 @@ void stage2_unmap_vm(struct kvm *kvm)
> >>up_read(>mm->mmap_sem);
> >>srcu_read_unlock(>srcu, idx);
> >>  }
> >>-
> >>-/**
> >>- * kvm_free_stage2_pgd - free all stage-2 tables
> >>- * @kvm:   The KVM struct pointer for the VM.
> >>- *
> >>- * Walks the level-1 page table 

Re: [PATCH v1 08/16] kvm: arm/arm64: Clean up stage2 pgd life time

2018-02-09 Thread Christoffer Dall
On Thu, Feb 08, 2018 at 05:19:22PM +, Suzuki K Poulose wrote:
> On 08/02/18 11:00, Christoffer Dall wrote:
> >On Tue, Jan 09, 2018 at 07:04:03PM +, Suzuki K Poulose wrote:
> >>On arm/arm64 we pre-allocate the entry level page tables when
> >>a VM is created and is free'd when either all the mm users are
> >>gone or the KVM is about to get destroyed. i.e, kvm_free_stage2_pgd
> >>is triggered via kvm_arch_flush_shadow_all() which can be invoked
> >>from two different paths :
> >>
> >>  1) do_exit()-> .-> mmu_notifier->release()-> ..-> 
> >> kvm_arch_flush_shadow_all()
> >> OR
> >>  2) kvm_destroy_vm()-> mmu_notifier_unregister-> 
> >> kvm_arch_flush_shadow_all()
> >>
> >>This has created lot of race conditions in the past as some of
> >>the VCPUs could be active when we free the stage2 via path (1).
> >
> >How??  mmu_notifier->release() is called via __mput->exit_mmap(), which
> >is only called if mm_users == 0, which means there are no more threads
> >left than the one currently doing exit().
> 
> IIRC, if the process is sent a fatal signal, that could cause all the threads
> to exit, leaving the "last" thread to do the clean up. The files could still
> be open, implying that the KVM fds are still active, without a stage2, even
> though we are not going to run anything. (The race was fixed by moving the
> stage2 teardown to mmu_notifier->release()).
> 
> 

Hmm, if the last thread is do_exit(), by definition there can't be any
other VCPU thread (because then it wouldn't be the last one) and
therefore only this last exiting thread can have the fd open, and since
it's in the middle of do_exit(), it will close the fds before anything
will have chance to run.

> >
> >>
> >>On a closer look, all we need to do with kvm_arch_flush_shadow_all() is,
> >>to ensure that the stage2 mappings are cleared. This doesn't mean we
> >>have to free up the stage2 entry level page tables yet, which could
> >>be delayed until the kvm is destroyed. This would avoid issues
> >>of use-after-free,
> >
> >do we have any of those left?
> 
> None that we know of.
> 

Then I think this commit text is misleading and pretty confusing.  If
we have a correct implementation, but we want to clean something up,
then this commit message shouldn't talk about races or use-after-free,
it should just say that we rearrange code to change the flow, and
describe why/how.

> >
> >>This will be later used for delaying
> >>the allocation of the stage2 entry level page tables until we really
> >>need to do something with it.
> >
> >Fine, but you don't actually explain why this change of flow is
> >necessary for what you're trying to do later?
> 
> This patch is not mandatory for the series. But, since we are delaying
> the "allocation" stage2 tables anyway later, I thought it would be
> good to clean up the "free" path.
> 

Hmm, I'm not really sure it is a cleanup.  In any case, the motivation
for this change should be clear.  I do like the idea of getting read of
the kvm->arch.pgd checks in the various stage2 manipulation functions.

> >>  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>index 78253fe00fc4..c94c61ac38b9 100644
> >>--- a/virt/kvm/arm/mmu.c
> >>+++ b/virt/kvm/arm/mmu.c
> >>@@ -298,11 +298,10 @@ static void unmap_stage2_range(struct kvm *kvm, 
> >>phys_addr_t start, u64 size)
> >>pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> >>do {
> >>/*
> >>-* Make sure the page table is still active, as another thread
> >>-* could have possibly freed the page table, while we released
> >>-* the lock.
> >>+* The page table shouldn't be free'd as we still hold a 
> >>reference
> >>+* to the KVM.
> >
> >To avoid confusion about references to the kernel module KVM and a
> >specific KVM VM instance, please s/KVM/VM/.
> 
> ok.
> 
> >
> >> */
> >>-   if (!READ_ONCE(kvm->arch.pgd))
> >>+   if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))
> >
> >This reads a lot like a defensive implementation now, and I think for
> >this patch to make sense, we shouldn't try to handle a buggy super-racy
> >implementation gracefully, but rather have VM_BUG_ON() (if we care about
> >performance of the check) or simply BUG_ON().
> >
> >The rationale being that if we've gotten this flow incorrect and freed
> >the pgd at the wrong time, we don't want to leave a ticking bomb to blow
> >up somewhere else randomly (which it will!), but instead crash and burn.
> 
> Ok.
> 
> >
> >>break;
> >>next = stage2_pgd_addr_end(addr, end);
> >>if (!stage2_pgd_none(*pgd))
> >>@@ -837,30 +836,33 @@ void stage2_unmap_vm(struct kvm *kvm)
> >>up_read(>mm->mmap_sem);
> >>srcu_read_unlock(>srcu, idx);
> >>  }
> >>-
> >>-/**
> >>- * kvm_free_stage2_pgd - free all stage-2 tables
> >>- * @kvm:   The KVM struct pointer for the VM.
> >>- *
> >>- * Walks the level-1 page table 

Re: [PATCH v1 08/16] kvm: arm/arm64: Clean up stage2 pgd life time

2018-02-08 Thread Suzuki K Poulose

On 08/02/18 11:00, Christoffer Dall wrote:

On Tue, Jan 09, 2018 at 07:04:03PM +, Suzuki K Poulose wrote:

On arm/arm64 we pre-allocate the entry level page tables when
a VM is created and is free'd when either all the mm users are
gone or the KVM is about to get destroyed. i.e, kvm_free_stage2_pgd
is triggered via kvm_arch_flush_shadow_all() which can be invoked
from two different paths :

  1) do_exit()-> .-> mmu_notifier->release()-> ..-> kvm_arch_flush_shadow_all()
 OR
  2) kvm_destroy_vm()-> mmu_notifier_unregister-> kvm_arch_flush_shadow_all()

This has created lot of race conditions in the past as some of
the VCPUs could be active when we free the stage2 via path (1).


How??  mmu_notifier->release() is called via __mput->exit_mmap(), which
is only called if mm_users == 0, which means there are no more threads
left than the one currently doing exit().


IIRC, if the process is sent a fatal signal, that could cause all the threads
to exit, leaving the "last" thread to do the clean up. The files could still
be open, implying that the KVM fds are still active, without a stage2, even
though we are not going to run anything. (The race was fixed by moving the
stage2 teardown to mmu_notifier->release()).






On a closer look, all we need to do with kvm_arch_flush_shadow_all() is,
to ensure that the stage2 mappings are cleared. This doesn't mean we
have to free up the stage2 entry level page tables yet, which could
be delayed until the kvm is destroyed. This would avoid issues
of use-after-free,


do we have any of those left?


None that we know of.




This will be later used for delaying
the allocation of the stage2 entry level page tables until we really
need to do something with it.


Fine, but you don't actually explain why this change of flow is
necessary for what you're trying to do later?


This patch is not mandatory for the series. But, since we are delaying
the "allocation" stage2 tables anyway later, I thought it would be
good to clean up the "free" path.


  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 78253fe00fc4..c94c61ac38b9 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -298,11 +298,10 @@ static void unmap_stage2_range(struct kvm *kvm, 
phys_addr_t start, u64 size)
pgd = kvm->arch.pgd + stage2_pgd_index(addr);
do {
/*
-* Make sure the page table is still active, as another thread
-* could have possibly freed the page table, while we released
-* the lock.
+* The page table shouldn't be free'd as we still hold a 
reference
+* to the KVM.


To avoid confusion about references to the kernel module KVM and a
specific KVM VM instance, please s/KVM/VM/.


ok.




 */
-   if (!READ_ONCE(kvm->arch.pgd))
+   if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))


This reads a lot like a defensive implementation now, and I think for
this patch to make sense, we shouldn't try to handle a buggy super-racy
implementation gracefully, but rather have VM_BUG_ON() (if we care about
performance of the check) or simply BUG_ON().

The rationale being that if we've gotten this flow incorrect and freed
the pgd at the wrong time, we don't want to leave a ticking bomb to blow
up somewhere else randomly (which it will!), but instead crash and burn.


Ok.




break;
next = stage2_pgd_addr_end(addr, end);
if (!stage2_pgd_none(*pgd))
@@ -837,30 +836,33 @@ void stage2_unmap_vm(struct kvm *kvm)
up_read(>mm->mmap_sem);
srcu_read_unlock(>srcu, idx);
  }
-
-/**
- * kvm_free_stage2_pgd - free all stage-2 tables
- * @kvm:   The KVM struct pointer for the VM.
- *
- * Walks the level-1 page table pointed to by kvm->arch.pgd and frees all
- * underlying level-2 and level-3 tables before freeing the actual level-1 
table
- * and setting the struct pointer to NULL.
+/*
+ * kvm_flush_stage2_all: Unmap the entire stage2 mappings including
+ * device and regular RAM backing memory.
   */
-void kvm_free_stage2_pgd(struct kvm *kvm)
+static void kvm_flush_stage2_all(struct kvm *kvm)
  {
-   void *pgd = NULL;
-
spin_lock(>mmu_lock);
-   if (kvm->arch.pgd) {
+   if (kvm->arch.pgd)
unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
-   pgd = READ_ONCE(kvm->arch.pgd);
-   kvm->arch.pgd = NULL;
-   }
spin_unlock(>mmu_lock);
+}
  
-	/* Free the HW pgd, one page at a time */

-   if (pgd)
-   free_pages_exact(pgd, S2_PGD_SIZE);
+/**
+ * kvm_free_stage2_pgd - Free the entry level page tables in stage-2.


nit: you should put the parameter description here and leave a blank
line before the lengthy discussion.


+ * This is called when all reference to the KVM has gone away and we
+ * really don't need any protection in resetting the PGD. This also


I don't 

Re: [PATCH v1 08/16] kvm: arm/arm64: Clean up stage2 pgd life time

2018-02-08 Thread Suzuki K Poulose

On 08/02/18 11:00, Christoffer Dall wrote:

On Tue, Jan 09, 2018 at 07:04:03PM +, Suzuki K Poulose wrote:

On arm/arm64 we pre-allocate the entry level page tables when
a VM is created and is free'd when either all the mm users are
gone or the KVM is about to get destroyed. i.e, kvm_free_stage2_pgd
is triggered via kvm_arch_flush_shadow_all() which can be invoked
from two different paths :

  1) do_exit()-> .-> mmu_notifier->release()-> ..-> kvm_arch_flush_shadow_all()
 OR
  2) kvm_destroy_vm()-> mmu_notifier_unregister-> kvm_arch_flush_shadow_all()

This has created lot of race conditions in the past as some of
the VCPUs could be active when we free the stage2 via path (1).


How??  mmu_notifier->release() is called via __mput->exit_mmap(), which
is only called if mm_users == 0, which means there are no more threads
left than the one currently doing exit().


IIRC, if the process is sent a fatal signal, that could cause all the threads
to exit, leaving the "last" thread to do the clean up. The files could still
be open, implying that the KVM fds are still active, without a stage2, even
though we are not going to run anything. (The race was fixed by moving the
stage2 teardown to mmu_notifier->release()).






On a closer look, all we need to do with kvm_arch_flush_shadow_all() is,
to ensure that the stage2 mappings are cleared. This doesn't mean we
have to free up the stage2 entry level page tables yet, which could
be delayed until the kvm is destroyed. This would avoid issues
of use-after-free,


do we have any of those left?


None that we know of.




This will be later used for delaying
the allocation of the stage2 entry level page tables until we really
need to do something with it.


Fine, but you don't actually explain why this change of flow is
necessary for what you're trying to do later?


This patch is not mandatory for the series. But, since we are delaying
the "allocation" stage2 tables anyway later, I thought it would be
good to clean up the "free" path.


  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 78253fe00fc4..c94c61ac38b9 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -298,11 +298,10 @@ static void unmap_stage2_range(struct kvm *kvm, 
phys_addr_t start, u64 size)
pgd = kvm->arch.pgd + stage2_pgd_index(addr);
do {
/*
-* Make sure the page table is still active, as another thread
-* could have possibly freed the page table, while we released
-* the lock.
+* The page table shouldn't be free'd as we still hold a 
reference
+* to the KVM.


To avoid confusion about references to the kernel module KVM and a
specific KVM VM instance, please s/KVM/VM/.


ok.




 */
-   if (!READ_ONCE(kvm->arch.pgd))
+   if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))


This reads a lot like a defensive implementation now, and I think for
this patch to make sense, we shouldn't try to handle a buggy super-racy
implementation gracefully, but rather have VM_BUG_ON() (if we care about
performance of the check) or simply BUG_ON().

The rationale being that if we've gotten this flow incorrect and freed
the pgd at the wrong time, we don't want to leave a ticking bomb to blow
up somewhere else randomly (which it will!), but instead crash and burn.


Ok.




break;
next = stage2_pgd_addr_end(addr, end);
if (!stage2_pgd_none(*pgd))
@@ -837,30 +836,33 @@ void stage2_unmap_vm(struct kvm *kvm)
up_read(>mm->mmap_sem);
srcu_read_unlock(>srcu, idx);
  }
-
-/**
- * kvm_free_stage2_pgd - free all stage-2 tables
- * @kvm:   The KVM struct pointer for the VM.
- *
- * Walks the level-1 page table pointed to by kvm->arch.pgd and frees all
- * underlying level-2 and level-3 tables before freeing the actual level-1 
table
- * and setting the struct pointer to NULL.
+/*
+ * kvm_flush_stage2_all: Unmap the entire stage2 mappings including
+ * device and regular RAM backing memory.
   */
-void kvm_free_stage2_pgd(struct kvm *kvm)
+static void kvm_flush_stage2_all(struct kvm *kvm)
  {
-   void *pgd = NULL;
-
spin_lock(>mmu_lock);
-   if (kvm->arch.pgd) {
+   if (kvm->arch.pgd)
unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
-   pgd = READ_ONCE(kvm->arch.pgd);
-   kvm->arch.pgd = NULL;
-   }
spin_unlock(>mmu_lock);
+}
  
-	/* Free the HW pgd, one page at a time */

-   if (pgd)
-   free_pages_exact(pgd, S2_PGD_SIZE);
+/**
+ * kvm_free_stage2_pgd - Free the entry level page tables in stage-2.


nit: you should put the parameter description here and leave a blank
line before the lengthy discussion.


+ * This is called when all reference to the KVM has gone away and we
+ * really don't need any protection in resetting the PGD. This also


I don't 

Re: [PATCH v1 08/16] kvm: arm/arm64: Clean up stage2 pgd life time

2018-02-08 Thread Christoffer Dall
On Tue, Jan 09, 2018 at 07:04:03PM +, Suzuki K Poulose wrote:
> On arm/arm64 we pre-allocate the entry level page tables when
> a VM is created and is free'd when either all the mm users are
> gone or the KVM is about to get destroyed. i.e, kvm_free_stage2_pgd
> is triggered via kvm_arch_flush_shadow_all() which can be invoked
> from two different paths :
> 
>  1) do_exit()-> .-> mmu_notifier->release()-> ..-> kvm_arch_flush_shadow_all()
> OR
>  2) kvm_destroy_vm()-> mmu_notifier_unregister-> kvm_arch_flush_shadow_all()
> 
> This has created lot of race conditions in the past as some of
> the VCPUs could be active when we free the stage2 via path (1).

How??  mmu_notifier->release() is called via __mput->exit_mmap(), which
is only called if mm_users == 0, which means there are no more threads
left than the one currently doing exit().

> 
> On a closer look, all we need to do with kvm_arch_flush_shadow_all() is,
> to ensure that the stage2 mappings are cleared. This doesn't mean we
> have to free up the stage2 entry level page tables yet, which could
> be delayed until the kvm is destroyed. This would avoid issues
> of use-after-free,

do we have any of those left?

> This will be later used for delaying
> the allocation of the stage2 entry level page tables until we really
> need to do something with it.

Fine, but you don't actually explain why this change of flow is
necessary for what you're trying to do later?

> 
> Cc: Marc Zyngier 
> Cc: Christoffer Dall 
> Signed-off-by: Suzuki K Poulose 
> ---
>  virt/kvm/arm/arm.c |  1 +
>  virt/kvm/arm/mmu.c | 56 
> --
>  2 files changed, 30 insertions(+), 27 deletions(-)
> 
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index c8d49879307f..19b720ddedce 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -189,6 +189,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>   }
>   }
>   atomic_set(>online_vcpus, 0);
> + kvm_free_stage2_pgd(kvm);
>  }
>  
>  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 78253fe00fc4..c94c61ac38b9 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -298,11 +298,10 @@ static void unmap_stage2_range(struct kvm *kvm, 
> phys_addr_t start, u64 size)
>   pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>   do {
>   /*
> -  * Make sure the page table is still active, as another thread
> -  * could have possibly freed the page table, while we released
> -  * the lock.
> +  * The page table shouldn't be free'd as we still hold a 
> reference
> +  * to the KVM.

To avoid confusion about references to the kernel module KVM and a
specific KVM VM instance, please s/KVM/VM/.

>*/
> - if (!READ_ONCE(kvm->arch.pgd))
> + if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))

This reads a lot like a defensive implementation now, and I think for
this patch to make sense, we shouldn't try to handle a buggy super-racy
implementation gracefully, but rather have VM_BUG_ON() (if we care about
performance of the check) or simply BUG_ON().

The rationale being that if we've gotten this flow incorrect and freed
the pgd at the wrong time, we don't want to leave a ticking bomb to blow
up somewhere else randomly (which it will!), but instead crash and burn.

>   break;
>   next = stage2_pgd_addr_end(addr, end);
>   if (!stage2_pgd_none(*pgd))
> @@ -837,30 +836,33 @@ void stage2_unmap_vm(struct kvm *kvm)
>   up_read(>mm->mmap_sem);
>   srcu_read_unlock(>srcu, idx);
>  }
> -
> -/**
> - * kvm_free_stage2_pgd - free all stage-2 tables
> - * @kvm: The KVM struct pointer for the VM.
> - *
> - * Walks the level-1 page table pointed to by kvm->arch.pgd and frees all
> - * underlying level-2 and level-3 tables before freeing the actual level-1 
> table
> - * and setting the struct pointer to NULL.
> +/*
> + * kvm_flush_stage2_all: Unmap the entire stage2 mappings including
> + * device and regular RAM backing memory.
>   */
> -void kvm_free_stage2_pgd(struct kvm *kvm)
> +static void kvm_flush_stage2_all(struct kvm *kvm)
>  {
> - void *pgd = NULL;
> -
>   spin_lock(>mmu_lock);
> - if (kvm->arch.pgd) {
> + if (kvm->arch.pgd)
>   unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> - pgd = READ_ONCE(kvm->arch.pgd);
> - kvm->arch.pgd = NULL;
> - }
>   spin_unlock(>mmu_lock);
> +}
>  
> - /* Free the HW pgd, one page at a time */
> - if (pgd)
> - free_pages_exact(pgd, S2_PGD_SIZE);
> +/**
> + * kvm_free_stage2_pgd - Free the entry level page tables in stage-2.

nit: you should put the parameter description here and leave a blank
line before the lengthy discussion.

> + * This is called when all 

Re: [PATCH v1 08/16] kvm: arm/arm64: Clean up stage2 pgd life time

2018-02-08 Thread Christoffer Dall
On Tue, Jan 09, 2018 at 07:04:03PM +, Suzuki K Poulose wrote:
> On arm/arm64 we pre-allocate the entry level page tables when
> a VM is created and is free'd when either all the mm users are
> gone or the KVM is about to get destroyed. i.e, kvm_free_stage2_pgd
> is triggered via kvm_arch_flush_shadow_all() which can be invoked
> from two different paths :
> 
>  1) do_exit()-> .-> mmu_notifier->release()-> ..-> kvm_arch_flush_shadow_all()
> OR
>  2) kvm_destroy_vm()-> mmu_notifier_unregister-> kvm_arch_flush_shadow_all()
> 
> This has created lot of race conditions in the past as some of
> the VCPUs could be active when we free the stage2 via path (1).

How??  mmu_notifier->release() is called via __mput->exit_mmap(), which
is only called if mm_users == 0, which means there are no more threads
left than the one currently doing exit().

> 
> On a closer look, all we need to do with kvm_arch_flush_shadow_all() is,
> to ensure that the stage2 mappings are cleared. This doesn't mean we
> have to free up the stage2 entry level page tables yet, which could
> be delayed until the kvm is destroyed. This would avoid issues
> of use-after-free,

do we have any of those left?

> This will be later used for delaying
> the allocation of the stage2 entry level page tables until we really
> need to do something with it.

Fine, but you don't actually explain why this change of flow is
necessary for what you're trying to do later?

> 
> Cc: Marc Zyngier 
> Cc: Christoffer Dall 
> Signed-off-by: Suzuki K Poulose 
> ---
>  virt/kvm/arm/arm.c |  1 +
>  virt/kvm/arm/mmu.c | 56 
> --
>  2 files changed, 30 insertions(+), 27 deletions(-)
> 
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index c8d49879307f..19b720ddedce 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -189,6 +189,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>   }
>   }
>   atomic_set(>online_vcpus, 0);
> + kvm_free_stage2_pgd(kvm);
>  }
>  
>  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 78253fe00fc4..c94c61ac38b9 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -298,11 +298,10 @@ static void unmap_stage2_range(struct kvm *kvm, 
> phys_addr_t start, u64 size)
>   pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>   do {
>   /*
> -  * Make sure the page table is still active, as another thread
> -  * could have possibly freed the page table, while we released
> -  * the lock.
> +  * The page table shouldn't be free'd as we still hold a 
> reference
> +  * to the KVM.

To avoid confusion about references to the kernel module KVM and a
specific KVM VM instance, please s/KVM/VM/.

>*/
> - if (!READ_ONCE(kvm->arch.pgd))
> + if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))

This reads a lot like a defensive implementation now, and I think for
this patch to make sense, we shouldn't try to handle a buggy super-racy
implementation gracefully, but rather have VM_BUG_ON() (if we care about
performance of the check) or simply BUG_ON().

The rationale being that if we've gotten this flow incorrect and freed
the pgd at the wrong time, we don't want to leave a ticking bomb to blow
up somewhere else randomly (which it will!), but instead crash and burn.

>   break;
>   next = stage2_pgd_addr_end(addr, end);
>   if (!stage2_pgd_none(*pgd))
> @@ -837,30 +836,33 @@ void stage2_unmap_vm(struct kvm *kvm)
>   up_read(>mm->mmap_sem);
>   srcu_read_unlock(>srcu, idx);
>  }
> -
> -/**
> - * kvm_free_stage2_pgd - free all stage-2 tables
> - * @kvm: The KVM struct pointer for the VM.
> - *
> - * Walks the level-1 page table pointed to by kvm->arch.pgd and frees all
> - * underlying level-2 and level-3 tables before freeing the actual level-1 
> table
> - * and setting the struct pointer to NULL.
> +/*
> + * kvm_flush_stage2_all: Unmap the entire stage2 mappings including
> + * device and regular RAM backing memory.
>   */
> -void kvm_free_stage2_pgd(struct kvm *kvm)
> +static void kvm_flush_stage2_all(struct kvm *kvm)
>  {
> - void *pgd = NULL;
> -
>   spin_lock(>mmu_lock);
> - if (kvm->arch.pgd) {
> + if (kvm->arch.pgd)
>   unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> - pgd = READ_ONCE(kvm->arch.pgd);
> - kvm->arch.pgd = NULL;
> - }
>   spin_unlock(>mmu_lock);
> +}
>  
> - /* Free the HW pgd, one page at a time */
> - if (pgd)
> - free_pages_exact(pgd, S2_PGD_SIZE);
> +/**
> + * kvm_free_stage2_pgd - Free the entry level page tables in stage-2.

nit: you should put the parameter description here and leave a blank
line before the lengthy discussion.

> + * This is called when all reference to the KVM has gone away and we
> + * really don't need