Re: [PATCH 1/3] KVM: arm/arm64: Properly handle faulting of device mappings

2019-12-18 Thread Christoffer Dall
On Mon, Dec 16, 2019 at 10:31:19AM +, Marc Zyngier wrote:
> On 2019-12-13 11:14, Christoffer Dall wrote:
> > On Fri, Dec 13, 2019 at 09:28:59AM +, Marc Zyngier wrote:
> > > Hi Christoffer,
> > > 
> > > On 2019-12-13 08:29, Christoffer Dall wrote:
> > > > Hi Marc,
> > > >
> > > > On Wed, Dec 11, 2019 at 04:56:48PM +, Marc Zyngier wrote:
> > > > > A device mapping is normally always mapped at Stage-2, since
> > > there
> > > > > is very little gain in having it faulted in.
> > > >
> > > > It is actually becoming less clear to me what the real benefits of
> > > > pre-populating the stage 2 page table are, especially given that
> > > we can
> > > > provoke a situation where they're faulted in anyhow.  Do you
> > > recall if
> > > > we had any specific case that motivated us to pre-fault in the
> > > pages?
> > > 
> > > It's only a minor performance optimization that was introduced by
> > > Ard in
> > > 8eef91239e57d. Which makes sense for platform devices that have a
> > > single
> > > fixed location in memory. It makes slightly less sense for PCI,
> > > where
> > > you can move things around.
> > 
> > User space could still decide to move things around in its VA map even
> > if the device is fixed.
> > 
> > Anyway, I was thinking more if there was some sort of device, like a
> > frambuffer, which for example crosses page boundaries and where it would
> > be visible to the user that there's a sudden performance drop while
> > operating the device over page boundaries.  Anything like that?
> > 
> > > 
> > > > > Nonetheless, it is possible to end-up in a situation where the
> > > > > device
> > > > > mapping has been removed from Stage-2 (userspace munmaped the
> > > VFIO
> > > > > region, and the MMU notifier did its job), but present in a
> > > > > userspace
> > > > > mapping (userpace has mapped it back at the same address). In
> > > such
> > > > > a situation, the device mapping will be demand-paged as the
> > > guest
> > > > > performs memory accesses.
> > > > >
> > > > > This requires to be careful when dealing with mapping size,
> > > cache
> > > > > management, and to handle potential execution of a device
> > > mapping.
> > > > >
> > > > > Cc: sta...@vger.kernel.org
> > > > > Reported-by: Alexandru Elisei 
> > > > > Signed-off-by: Marc Zyngier 
> > > > > ---
> > > > >  virt/kvm/arm/mmu.c | 21 +
> > > > >  1 file changed, 17 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> > > > > index a48994af70b8..0b32a904a1bb 100644
> > > > > --- a/virt/kvm/arm/mmu.c
> > > > > +++ b/virt/kvm/arm/mmu.c
> > > > > @@ -38,6 +38,11 @@ static unsigned long io_map_base;
> > > > >  #define KVM_S2PTE_FLAG_IS_IOMAP  (1UL << 0)
> > > > >  #define KVM_S2_FLAG_LOGGING_ACTIVE   (1UL << 1)
> > > > >
> > > > > +static bool is_iomap(unsigned long flags)
> > > > > +{
> > > > > + return flags & KVM_S2PTE_FLAG_IS_IOMAP;
> > > > > +}
> > > > > +
> > > >
> > > > nit: I'm not really sure this indirection makes the code more
> > > readable,
> > > > but I guess that's a matter of taste.
> > > >
> > > > >  static bool memslot_is_logging(struct kvm_memory_slot *memslot)
> > > > >  {
> > > > >   return memslot->dirty_bitmap && !(memslot->flags &
> > > > > KVM_MEM_READONLY);
> > > > > @@ -1698,6 +1703,7 @@ static int user_mem_abort(struct kvm_vcpu
> > > > > *vcpu, phys_addr_t fault_ipa,
> > > > >
> > > > >   vma_pagesize = vma_kernel_pagesize(vma);
> > > > >   if (logging_active ||
> > > > > + (vma->vm_flags & VM_PFNMAP) ||
> > > >
> > > > WHat is actually the rationale for this?
> > > >
> > > > Why is a huge mapping not permitted to device memory?
> > > >
> > > > Are we guaranteed that VM_PFNMAP on the vma results in device
> > > mappings?
> > > > I'm not convinced this is the case, and it would be better if we
> > > can
> > > > stick to a single primitive (either kvm_is_device_pfn, or
> > > VM_PFNMAP) to
> > > > detect device mappings.
> > > 
> > > For now, I've tried to keep the two paths that deal with mapping
> > > devices
> > > (or rather, things that we interpret as devices) as close as
> > > possible.
> > > If we drop the "eager" mapping, then we're at liberty to restructure
> > > this in creative ways.
> > > 
> > > This includes potential huge mappings, but I'm not sure the rest of
> > > the
> > > kernel uses them for devices anyway (I need to find out).
> > > 
> > > > As a subsequent patch, I'd like to make sure that at the very
> > > least our
> > > > memslot prepare function follows the exact same logic for mapping
> > > device
> > > > memory as a fault-in approach does, or that we simply always fault
> > > pages
> > > > in.
> > > 
> > > As far as I can see, the two approach are now identical. Am I
> > > missing
> > > something?
> > > And yes, getting rid of the eager mapping works for me.
> > > 
> > 
> > As far as I can tell, our user_mem_abort() uses gfn_to_pfn_prot() which
> > goes doesn a long trail which 

Re: [PATCH 1/3] KVM: arm/arm64: Properly handle faulting of device mappings

2019-12-16 Thread Marc Zyngier

On 2019-12-13 11:14, Christoffer Dall wrote:

On Fri, Dec 13, 2019 at 09:28:59AM +, Marc Zyngier wrote:

Hi Christoffer,

On 2019-12-13 08:29, Christoffer Dall wrote:
> Hi Marc,
>
> On Wed, Dec 11, 2019 at 04:56:48PM +, Marc Zyngier wrote:
> > A device mapping is normally always mapped at Stage-2, since 
there

> > is very little gain in having it faulted in.
>
> It is actually becoming less clear to me what the real benefits of
> pre-populating the stage 2 page table are, especially given that 
we can
> provoke a situation where they're faulted in anyhow.  Do you 
recall if
> we had any specific case that motivated us to pre-fault in the 
pages?


It's only a minor performance optimization that was introduced by 
Ard in
8eef91239e57d. Which makes sense for platform devices that have a 
single
fixed location in memory. It makes slightly less sense for PCI, 
where

you can move things around.


User space could still decide to move things around in its VA map 
even

if the device is fixed.

Anyway, I was thinking more if there was some sort of device, like a
frambuffer, which for example crosses page boundaries and where it 
would

be visible to the user that there's a sudden performance drop while
operating the device over page boundaries.  Anything like that?



> > Nonetheless, it is possible to end-up in a situation where the
> > device
> > mapping has been removed from Stage-2 (userspace munmaped the 
VFIO

> > region, and the MMU notifier did its job), but present in a
> > userspace
> > mapping (userpace has mapped it back at the same address). In 
such
> > a situation, the device mapping will be demand-paged as the 
guest

> > performs memory accesses.
> >
> > This requires to be careful when dealing with mapping size, 
cache
> > management, and to handle potential execution of a device 
mapping.

> >
> > Cc: sta...@vger.kernel.org
> > Reported-by: Alexandru Elisei 
> > Signed-off-by: Marc Zyngier 
> > ---
> >  virt/kvm/arm/mmu.c | 21 +
> >  1 file changed, 17 insertions(+), 4 deletions(-)
> >
> > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> > index a48994af70b8..0b32a904a1bb 100644
> > --- a/virt/kvm/arm/mmu.c
> > +++ b/virt/kvm/arm/mmu.c
> > @@ -38,6 +38,11 @@ static unsigned long io_map_base;
> >  #define KVM_S2PTE_FLAG_IS_IOMAP  (1UL << 0)
> >  #define KVM_S2_FLAG_LOGGING_ACTIVE   (1UL << 1)
> >
> > +static bool is_iomap(unsigned long flags)
> > +{
> > + return flags & KVM_S2PTE_FLAG_IS_IOMAP;
> > +}
> > +
>
> nit: I'm not really sure this indirection makes the code more 
readable,

> but I guess that's a matter of taste.
>
> >  static bool memslot_is_logging(struct kvm_memory_slot *memslot)
> >  {
> >   return memslot->dirty_bitmap && !(memslot->flags &
> > KVM_MEM_READONLY);
> > @@ -1698,6 +1703,7 @@ static int user_mem_abort(struct kvm_vcpu
> > *vcpu, phys_addr_t fault_ipa,
> >
> >   vma_pagesize = vma_kernel_pagesize(vma);
> >   if (logging_active ||
> > + (vma->vm_flags & VM_PFNMAP) ||
>
> WHat is actually the rationale for this?
>
> Why is a huge mapping not permitted to device memory?
>
> Are we guaranteed that VM_PFNMAP on the vma results in device 
mappings?
> I'm not convinced this is the case, and it would be better if we 
can
> stick to a single primitive (either kvm_is_device_pfn, or 
VM_PFNMAP) to

> detect device mappings.

For now, I've tried to keep the two paths that deal with mapping 
devices
(or rather, things that we interpret as devices) as close as 
possible.

If we drop the "eager" mapping, then we're at liberty to restructure
this in creative ways.

This includes potential huge mappings, but I'm not sure the rest of 
the

kernel uses them for devices anyway (I need to find out).

> As a subsequent patch, I'd like to make sure that at the very 
least our
> memslot prepare function follows the exact same logic for mapping 
device
> memory as a fault-in approach does, or that we simply always fault 
pages

> in.

As far as I can see, the two approach are now identical. Am I 
missing

something?
And yes, getting rid of the eager mapping works for me.



As far as I can tell, our user_mem_abort() uses gfn_to_pfn_prot() 
which

goes doesn a long trail which ends up at hva_to_pfn_remapped(), which
might result in doing the same offset calculation that we do in
kvm_arch_prepare_memory_region(), but it also considers other 
scenarios.


Even if we analyze all that and convince oursleves it's always all 
the
same on arm64, the two code paths could change, leading to really 
hard
to debug differing behavior, and nobody will actively keep the two 
paths

in sync.  I'd be fine with keeping the performance optimization if we
have good grounds for that though, and using the same translation
mechanism for VM_PFNMAP as user_mem_abort.

Am I missing something?


I'm not disputing any of the above. I'm only trying to keep this patch
minimal so that we can easily backport it (although it is arguable that
deleting 

Re: [PATCH 1/3] KVM: arm/arm64: Properly handle faulting of device mappings

2019-12-13 Thread Christoffer Dall
On Fri, Dec 13, 2019 at 09:28:59AM +, Marc Zyngier wrote:
> Hi Christoffer,
> 
> On 2019-12-13 08:29, Christoffer Dall wrote:
> > Hi Marc,
> > 
> > On Wed, Dec 11, 2019 at 04:56:48PM +, Marc Zyngier wrote:
> > > A device mapping is normally always mapped at Stage-2, since there
> > > is very little gain in having it faulted in.
> > 
> > It is actually becoming less clear to me what the real benefits of
> > pre-populating the stage 2 page table are, especially given that we can
> > provoke a situation where they're faulted in anyhow.  Do you recall if
> > we had any specific case that motivated us to pre-fault in the pages?
> 
> It's only a minor performance optimization that was introduced by Ard in
> 8eef91239e57d. Which makes sense for platform devices that have a single
> fixed location in memory. It makes slightly less sense for PCI, where
> you can move things around.

User space could still decide to move things around in its VA map even
if the device is fixed.

Anyway, I was thinking more if there was some sort of device, like a
frambuffer, which for example crosses page boundaries and where it would
be visible to the user that there's a sudden performance drop while
operating the device over page boundaries.  Anything like that?

> 
> > > Nonetheless, it is possible to end-up in a situation where the
> > > device
> > > mapping has been removed from Stage-2 (userspace munmaped the VFIO
> > > region, and the MMU notifier did its job), but present in a
> > > userspace
> > > mapping (userpace has mapped it back at the same address). In such
> > > a situation, the device mapping will be demand-paged as the guest
> > > performs memory accesses.
> > > 
> > > This requires to be careful when dealing with mapping size, cache
> > > management, and to handle potential execution of a device mapping.
> > > 
> > > Cc: sta...@vger.kernel.org
> > > Reported-by: Alexandru Elisei 
> > > Signed-off-by: Marc Zyngier 
> > > ---
> > >  virt/kvm/arm/mmu.c | 21 +
> > >  1 file changed, 17 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> > > index a48994af70b8..0b32a904a1bb 100644
> > > --- a/virt/kvm/arm/mmu.c
> > > +++ b/virt/kvm/arm/mmu.c
> > > @@ -38,6 +38,11 @@ static unsigned long io_map_base;
> > >  #define KVM_S2PTE_FLAG_IS_IOMAP  (1UL << 0)
> > >  #define KVM_S2_FLAG_LOGGING_ACTIVE   (1UL << 1)
> > > 
> > > +static bool is_iomap(unsigned long flags)
> > > +{
> > > + return flags & KVM_S2PTE_FLAG_IS_IOMAP;
> > > +}
> > > +
> > 
> > nit: I'm not really sure this indirection makes the code more readable,
> > but I guess that's a matter of taste.
> > 
> > >  static bool memslot_is_logging(struct kvm_memory_slot *memslot)
> > >  {
> > >   return memslot->dirty_bitmap && !(memslot->flags &
> > > KVM_MEM_READONLY);
> > > @@ -1698,6 +1703,7 @@ static int user_mem_abort(struct kvm_vcpu
> > > *vcpu, phys_addr_t fault_ipa,
> > > 
> > >   vma_pagesize = vma_kernel_pagesize(vma);
> > >   if (logging_active ||
> > > + (vma->vm_flags & VM_PFNMAP) ||
> > 
> > WHat is actually the rationale for this?
> > 
> > Why is a huge mapping not permitted to device memory?
> > 
> > Are we guaranteed that VM_PFNMAP on the vma results in device mappings?
> > I'm not convinced this is the case, and it would be better if we can
> > stick to a single primitive (either kvm_is_device_pfn, or VM_PFNMAP) to
> > detect device mappings.
> 
> For now, I've tried to keep the two paths that deal with mapping devices
> (or rather, things that we interpret as devices) as close as possible.
> If we drop the "eager" mapping, then we're at liberty to restructure
> this in creative ways.
> 
> This includes potential huge mappings, but I'm not sure the rest of the
> kernel uses them for devices anyway (I need to find out).
> 
> > As a subsequent patch, I'd like to make sure that at the very least our
> > memslot prepare function follows the exact same logic for mapping device
> > memory as a fault-in approach does, or that we simply always fault pages
> > in.
> 
> As far as I can see, the two approach are now identical. Am I missing
> something?
> And yes, getting rid of the eager mapping works for me.
> 

As far as I can tell, our user_mem_abort() uses gfn_to_pfn_prot() which
goes doesn a long trail which ends up at hva_to_pfn_remapped(), which
might result in doing the same offset calculation that we do in
kvm_arch_prepare_memory_region(), but it also considers other scenarios.

Even if we analyze all that and convince oursleves it's always all the
same on arm64, the two code paths could change, leading to really hard
to debug differing behavior, and nobody will actively keep the two paths
in sync.  I'd be fine with keeping the performance optimization if we
have good grounds for that though, and using the same translation
mechanism for VM_PFNMAP as user_mem_abort.

Am I missing something?

> > 
> > >   !fault_supports_stage2_huge_mapping(memslot, hva,
> > > 

Re: [PATCH 1/3] KVM: arm/arm64: Properly handle faulting of device mappings

2019-12-13 Thread Marc Zyngier

Hi Christoffer,

On 2019-12-13 08:29, Christoffer Dall wrote:

Hi Marc,

On Wed, Dec 11, 2019 at 04:56:48PM +, Marc Zyngier wrote:

A device mapping is normally always mapped at Stage-2, since there
is very little gain in having it faulted in.


It is actually becoming less clear to me what the real benefits of
pre-populating the stage 2 page table are, especially given that we 
can
provoke a situation where they're faulted in anyhow.  Do you recall 
if

we had any specific case that motivated us to pre-fault in the pages?


It's only a minor performance optimization that was introduced by Ard 
in
8eef91239e57d. Which makes sense for platform devices that have a 
single

fixed location in memory. It makes slightly less sense for PCI, where
you can move things around.

Nonetheless, it is possible to end-up in a situation where the 
device

mapping has been removed from Stage-2 (userspace munmaped the VFIO
region, and the MMU notifier did its job), but present in a 
userspace

mapping (userpace has mapped it back at the same address). In such
a situation, the device mapping will be demand-paged as the guest
performs memory accesses.

This requires to be careful when dealing with mapping size, cache
management, and to handle potential execution of a device mapping.

Cc: sta...@vger.kernel.org
Reported-by: Alexandru Elisei 
Signed-off-by: Marc Zyngier 
---
 virt/kvm/arm/mmu.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index a48994af70b8..0b32a904a1bb 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -38,6 +38,11 @@ static unsigned long io_map_base;
 #define KVM_S2PTE_FLAG_IS_IOMAP(1UL << 0)
 #define KVM_S2_FLAG_LOGGING_ACTIVE (1UL << 1)

+static bool is_iomap(unsigned long flags)
+{
+   return flags & KVM_S2PTE_FLAG_IS_IOMAP;
+}
+


nit: I'm not really sure this indirection makes the code more 
readable,

but I guess that's a matter of taste.


 static bool memslot_is_logging(struct kvm_memory_slot *memslot)
 {
 	return memslot->dirty_bitmap && !(memslot->flags & 
KVM_MEM_READONLY);
@@ -1698,6 +1703,7 @@ static int user_mem_abort(struct kvm_vcpu 
*vcpu, phys_addr_t fault_ipa,


vma_pagesize = vma_kernel_pagesize(vma);
if (logging_active ||
+   (vma->vm_flags & VM_PFNMAP) ||


WHat is actually the rationale for this?

Why is a huge mapping not permitted to device memory?

Are we guaranteed that VM_PFNMAP on the vma results in device 
mappings?

I'm not convinced this is the case, and it would be better if we can
stick to a single primitive (either kvm_is_device_pfn, or VM_PFNMAP) 
to

detect device mappings.


For now, I've tried to keep the two paths that deal with mapping 
devices

(or rather, things that we interpret as devices) as close as possible.
If we drop the "eager" mapping, then we're at liberty to restructure
this in creative ways.

This includes potential huge mappings, but I'm not sure the rest of the
kernel uses them for devices anyway (I need to find out).

As a subsequent patch, I'd like to make sure that at the very least 
our
memslot prepare function follows the exact same logic for mapping 
device
memory as a fault-in approach does, or that we simply always fault 
pages

in.


As far as I can see, the two approach are now identical. Am I missing 
something?

And yes, getting rid of the eager mapping works for me.



 	!fault_supports_stage2_huge_mapping(memslot, hva, 
vma_pagesize)) {

force_pte = true;
vma_pagesize = PAGE_SIZE;
@@ -1760,6 +1766,9 @@ static int user_mem_abort(struct kvm_vcpu 
*vcpu, phys_addr_t fault_ipa,

writable = false;
}

+   if (exec_fault && is_iomap(flags))
+   return -ENOEXEC;
+


nit: why don't you just do this when checking kvm_is_device_pfn() and
avoid having logic in two places to deal with this case?


Good point. I've already sent the PR, but that could be a further 
cleanup.





spin_lock(>mmu_lock);
if (mmu_notifier_retry(kvm, mmu_seq))
goto out_unlock;
@@ -1781,7 +1790,7 @@ static int user_mem_abort(struct kvm_vcpu 
*vcpu, phys_addr_t fault_ipa,

if (writable)
kvm_set_pfn_dirty(pfn);

-   if (fault_status != FSC_PERM)
+   if (fault_status != FSC_PERM && !is_iomap(flags))
clean_dcache_guest_page(pfn, vma_pagesize);

if (exec_fault)
@@ -1948,9 +1957,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu 
*vcpu, struct kvm_run *run)

if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
if (is_iabt) {
/* Prefetch Abort on I/O address */
-   kvm_inject_pabt(vcpu, kvm_vcpu_get_hfar(vcpu));
-   ret = 1;
-   goto out_unlock;
+   ret = -ENOEXEC;
+   goto out;
}

/*
@@ 

Re: [PATCH 1/3] KVM: arm/arm64: Properly handle faulting of device mappings

2019-12-13 Thread Christoffer Dall
Hi Marc,

On Wed, Dec 11, 2019 at 04:56:48PM +, Marc Zyngier wrote:
> A device mapping is normally always mapped at Stage-2, since there
> is very little gain in having it faulted in.

It is actually becoming less clear to me what the real benefits of
pre-populating the stage 2 page table are, especially given that we can
provoke a situation where they're faulted in anyhow.  Do you recall if
we had any specific case that motivated us to pre-fault in the pages?

> 
> Nonetheless, it is possible to end-up in a situation where the device
> mapping has been removed from Stage-2 (userspace munmaped the VFIO
> region, and the MMU notifier did its job), but present in a userspace
> mapping (userpace has mapped it back at the same address). In such
> a situation, the device mapping will be demand-paged as the guest
> performs memory accesses.
> 
> This requires to be careful when dealing with mapping size, cache
> management, and to handle potential execution of a device mapping.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Alexandru Elisei 
> Signed-off-by: Marc Zyngier 
> ---
>  virt/kvm/arm/mmu.c | 21 +
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index a48994af70b8..0b32a904a1bb 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -38,6 +38,11 @@ static unsigned long io_map_base;
>  #define KVM_S2PTE_FLAG_IS_IOMAP  (1UL << 0)
>  #define KVM_S2_FLAG_LOGGING_ACTIVE   (1UL << 1)
>  
> +static bool is_iomap(unsigned long flags)
> +{
> + return flags & KVM_S2PTE_FLAG_IS_IOMAP;
> +}
> +

nit: I'm not really sure this indirection makes the code more readable,
but I guess that's a matter of taste.

>  static bool memslot_is_logging(struct kvm_memory_slot *memslot)
>  {
>   return memslot->dirty_bitmap && !(memslot->flags & KVM_MEM_READONLY);
> @@ -1698,6 +1703,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>  
>   vma_pagesize = vma_kernel_pagesize(vma);
>   if (logging_active ||
> + (vma->vm_flags & VM_PFNMAP) ||

WHat is actually the rationale for this?

Why is a huge mapping not permitted to device memory?

Are we guaranteed that VM_PFNMAP on the vma results in device mappings?
I'm not convinced this is the case, and it would be better if we can
stick to a single primitive (either kvm_is_device_pfn, or VM_PFNMAP) to
detect device mappings.

As a subsequent patch, I'd like to make sure that at the very least our
memslot prepare function follows the exact same logic for mapping device
memory as a fault-in approach does, or that we simply always fault pages
in.

>   !fault_supports_stage2_huge_mapping(memslot, hva, vma_pagesize)) {
>   force_pte = true;
>   vma_pagesize = PAGE_SIZE;
> @@ -1760,6 +1766,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   writable = false;
>   }
>  
> + if (exec_fault && is_iomap(flags))
> + return -ENOEXEC;
> +

nit: why don't you just do this when checking kvm_is_device_pfn() and
avoid having logic in two places to deal with this case?

>   spin_lock(>mmu_lock);
>   if (mmu_notifier_retry(kvm, mmu_seq))
>   goto out_unlock;
> @@ -1781,7 +1790,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   if (writable)
>   kvm_set_pfn_dirty(pfn);
>  
> - if (fault_status != FSC_PERM)
> + if (fault_status != FSC_PERM && !is_iomap(flags))
>   clean_dcache_guest_page(pfn, vma_pagesize);
>  
>   if (exec_fault)
> @@ -1948,9 +1957,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, 
> struct kvm_run *run)
>   if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
>   if (is_iabt) {
>   /* Prefetch Abort on I/O address */
> - kvm_inject_pabt(vcpu, kvm_vcpu_get_hfar(vcpu));
> - ret = 1;
> - goto out_unlock;
> + ret = -ENOEXEC;
> + goto out;
>   }
>  
>   /*
> @@ -1992,6 +2000,11 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, 
> struct kvm_run *run)
>   ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status);
>   if (ret == 0)
>   ret = 1;
> +out:
> + if (ret == -ENOEXEC) {
> + kvm_inject_pabt(vcpu, kvm_vcpu_get_hfar(vcpu));
> + ret = 1;
> + }
>  out_unlock:
>   srcu_read_unlock(>kvm->srcu, idx);
>   return ret;
> -- 
> 2.20.1
> 

I can't seem to decide for myself if I think there's a sematic
difference between trying to execute from somewhere the VMM has
explicitly told us is device memory and from somewhere which we happen
to have mapped with VM_PFNMAP from user space.  But I also can't seem to
really fault it (pun intended).  Thoughts?


Thanks,

Christoffer

Re: [PATCH 1/3] KVM: arm/arm64: Properly handle faulting of device mappings

2019-12-12 Thread James Morse
Hi Marc,

On 11/12/2019 16:56, Marc Zyngier wrote:
> A device mapping is normally always mapped at Stage-2, since there
> is very little gain in having it faulted in.
> 
> Nonetheless, it is possible to end-up in a situation where the device
> mapping has been removed from Stage-2 (userspace munmaped the VFIO
> region, and the MMU notifier did its job), but present in a userspace
> mapping (userpace has mapped it back at the same address). In such
> a situation, the device mapping will be demand-paged as the guest
> performs memory accesses.
> 
> This requires to be careful when dealing with mapping size, cache
> management, and to handle potential execution of a device mapping.

Reviewed-by: James Morse 


Thanks,

James
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/3] KVM: arm/arm64: Properly handle faulting of device mappings

2019-12-11 Thread Marc Zyngier

Hi Alex,

On 2019-12-11 17:53, Alexandru Elisei wrote:

Hi,

On 12/11/19 4:56 PM, Marc Zyngier wrote:

A device mapping is normally always mapped at Stage-2, since there
is very little gain in having it faulted in.

Nonetheless, it is possible to end-up in a situation where the 
device

mapping has been removed from Stage-2 (userspace munmaped the VFIO
region, and the MMU notifier did its job), but present in a 
userspace

mapping (userpace has mapped it back at the same address). In such
a situation, the device mapping will be demand-paged as the guest
performs memory accesses.

This requires to be careful when dealing with mapping size, cache
management, and to handle potential execution of a device mapping.

Cc: sta...@vger.kernel.org
Reported-by: Alexandru Elisei 
Signed-off-by: Marc Zyngier 
---
 virt/kvm/arm/mmu.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)


[...]

I've tested this patch using the scenario that you described in the 
commit

message. I've also tested it with the following scenarios:

- The region is mmap'ed as backed by a VFIO device fd and the memslot
is created,
it is munmap'ed, then mmap'ed as anonymous.
- The region is mmap'ed as anonymous and the memslot is created, it
is munmap'ed,
then mmap'ed as backed by a VFIO device fd.

Everything worked as expected, so:

Tested-by: Alexandru Elisei 


Thanks for that!

Just a nitpick, but stage2_set_pte has a local variable iomap which 
can be

replaced with a call to is_iomap.


Yeah, I noticed. I'm just trying to keep the patch relatively small so
that it can be easily backported. The cleanup can come later! ;-)

Cheers,

M.
--
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/3] KVM: arm/arm64: Properly handle faulting of device mappings

2019-12-11 Thread Alexandru Elisei
Hi,

On 12/11/19 4:56 PM, Marc Zyngier wrote:
> A device mapping is normally always mapped at Stage-2, since there
> is very little gain in having it faulted in.
>
> Nonetheless, it is possible to end-up in a situation where the device
> mapping has been removed from Stage-2 (userspace munmaped the VFIO
> region, and the MMU notifier did its job), but present in a userspace
> mapping (userpace has mapped it back at the same address). In such
> a situation, the device mapping will be demand-paged as the guest
> performs memory accesses.
>
> This requires to be careful when dealing with mapping size, cache
> management, and to handle potential execution of a device mapping.
>
> Cc: sta...@vger.kernel.org
> Reported-by: Alexandru Elisei 
> Signed-off-by: Marc Zyngier 
> ---
>  virt/kvm/arm/mmu.c | 21 +
>  1 file changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index a48994af70b8..0b32a904a1bb 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -38,6 +38,11 @@ static unsigned long io_map_base;
>  #define KVM_S2PTE_FLAG_IS_IOMAP  (1UL << 0)
>  #define KVM_S2_FLAG_LOGGING_ACTIVE   (1UL << 1)
>  
> +static bool is_iomap(unsigned long flags)
> +{
> + return flags & KVM_S2PTE_FLAG_IS_IOMAP;
> +}
> +
>  static bool memslot_is_logging(struct kvm_memory_slot *memslot)
>  {
>   return memslot->dirty_bitmap && !(memslot->flags & KVM_MEM_READONLY);
> @@ -1698,6 +1703,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>  
>   vma_pagesize = vma_kernel_pagesize(vma);
>   if (logging_active ||
> + (vma->vm_flags & VM_PFNMAP) ||
>   !fault_supports_stage2_huge_mapping(memslot, hva, vma_pagesize)) {
>   force_pte = true;
>   vma_pagesize = PAGE_SIZE;
> @@ -1760,6 +1766,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   writable = false;
>   }
>  
> + if (exec_fault && is_iomap(flags))
> + return -ENOEXEC;
> +
>   spin_lock(>mmu_lock);
>   if (mmu_notifier_retry(kvm, mmu_seq))
>   goto out_unlock;
> @@ -1781,7 +1790,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   if (writable)
>   kvm_set_pfn_dirty(pfn);
>  
> - if (fault_status != FSC_PERM)
> + if (fault_status != FSC_PERM && !is_iomap(flags))
>   clean_dcache_guest_page(pfn, vma_pagesize);
>  
>   if (exec_fault)
> @@ -1948,9 +1957,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, 
> struct kvm_run *run)
>   if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
>   if (is_iabt) {
>   /* Prefetch Abort on I/O address */
> - kvm_inject_pabt(vcpu, kvm_vcpu_get_hfar(vcpu));
> - ret = 1;
> - goto out_unlock;
> + ret = -ENOEXEC;
> + goto out;
>   }
>  
>   /*
> @@ -1992,6 +2000,11 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, 
> struct kvm_run *run)
>   ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status);
>   if (ret == 0)
>   ret = 1;
> +out:
> + if (ret == -ENOEXEC) {
> + kvm_inject_pabt(vcpu, kvm_vcpu_get_hfar(vcpu));
> + ret = 1;
> + }
>  out_unlock:
>   srcu_read_unlock(>kvm->srcu, idx);
>   return ret;

I've tested this patch using the scenario that you described in the commit
message. I've also tested it with the following scenarios:

- The region is mmap'ed as backed by a VFIO device fd and the memslot is 
created,
it is munmap'ed, then mmap'ed as anonymous.
- The region is mmap'ed as anonymous and the memslot is created, it is 
munmap'ed,
then mmap'ed as backed by a VFIO device fd.

Everything worked as expected, so:

Tested-by: Alexandru Elisei 

Just a nitpick, but stage2_set_pte has a local variable iomap which can be
replaced with a call to is_iomap.

Thanks,
Alex
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH 1/3] KVM: arm/arm64: Properly handle faulting of device mappings

2019-12-11 Thread Marc Zyngier
A device mapping is normally always mapped at Stage-2, since there
is very little gain in having it faulted in.

Nonetheless, it is possible to end-up in a situation where the device
mapping has been removed from Stage-2 (userspace munmaped the VFIO
region, and the MMU notifier did its job), but present in a userspace
mapping (userpace has mapped it back at the same address). In such
a situation, the device mapping will be demand-paged as the guest
performs memory accesses.

This requires to be careful when dealing with mapping size, cache
management, and to handle potential execution of a device mapping.

Cc: sta...@vger.kernel.org
Reported-by: Alexandru Elisei 
Signed-off-by: Marc Zyngier 
---
 virt/kvm/arm/mmu.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index a48994af70b8..0b32a904a1bb 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -38,6 +38,11 @@ static unsigned long io_map_base;
 #define KVM_S2PTE_FLAG_IS_IOMAP(1UL << 0)
 #define KVM_S2_FLAG_LOGGING_ACTIVE (1UL << 1)
 
+static bool is_iomap(unsigned long flags)
+{
+   return flags & KVM_S2PTE_FLAG_IS_IOMAP;
+}
+
 static bool memslot_is_logging(struct kvm_memory_slot *memslot)
 {
return memslot->dirty_bitmap && !(memslot->flags & KVM_MEM_READONLY);
@@ -1698,6 +1703,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 
vma_pagesize = vma_kernel_pagesize(vma);
if (logging_active ||
+   (vma->vm_flags & VM_PFNMAP) ||
!fault_supports_stage2_huge_mapping(memslot, hva, vma_pagesize)) {
force_pte = true;
vma_pagesize = PAGE_SIZE;
@@ -1760,6 +1766,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
writable = false;
}
 
+   if (exec_fault && is_iomap(flags))
+   return -ENOEXEC;
+
spin_lock(>mmu_lock);
if (mmu_notifier_retry(kvm, mmu_seq))
goto out_unlock;
@@ -1781,7 +1790,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (writable)
kvm_set_pfn_dirty(pfn);
 
-   if (fault_status != FSC_PERM)
+   if (fault_status != FSC_PERM && !is_iomap(flags))
clean_dcache_guest_page(pfn, vma_pagesize);
 
if (exec_fault)
@@ -1948,9 +1957,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
if (is_iabt) {
/* Prefetch Abort on I/O address */
-   kvm_inject_pabt(vcpu, kvm_vcpu_get_hfar(vcpu));
-   ret = 1;
-   goto out_unlock;
+   ret = -ENOEXEC;
+   goto out;
}
 
/*
@@ -1992,6 +2000,11 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status);
if (ret == 0)
ret = 1;
+out:
+   if (ret == -ENOEXEC) {
+   kvm_inject_pabt(vcpu, kvm_vcpu_get_hfar(vcpu));
+   ret = 1;
+   }
 out_unlock:
srcu_read_unlock(>kvm->srcu, idx);
return ret;
-- 
2.20.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm