Re: [RFC v2 12/20] dma-iommu: Implement NESTED_MSI cookie

2018-10-24 Thread Robin Murphy

On 2018-10-24 7:44 pm, Auger Eric wrote:

Hi Robin,

On 10/24/18 8:02 PM, Robin Murphy wrote:

Hi Eric,

On 2018-09-18 3:24 pm, Eric Auger wrote:

Up to now, when the type was UNMANAGED, we used to
allocate IOVA pages within a range provided by the user.
This does not work in nested mode.

If both the host and the guest are exposed with SMMUs, each
would allocate an IOVA. The guest allocates an IOVA (gIOVA)
to map onto the guest MSI doorbell (gDB). The Host allocates
another IOVA (hIOVA) to map onto the physical doorbell (hDB).

So we end up with 2 unrelated mappings, at S1 and S2:
   S1 S2
gIOVA    -> gDB
     hIOVA    ->    hDB

The PCI device would be programmed with hIOVA.

iommu_dma_bind_doorbell allows to pass gIOVA/gDB to the host
so that gIOVA can be used by the host instead of re-allocating
a new IOVA. That way the host can create the following nested
mapping:

   S1   S2
gIOVA    ->    gDB    ->    hDB

this time, the PCI device will be programmed with the gIOVA MSI
doorbell which is correctly map through the 2 stages.


If I'm understanding things correctly, this plus a couple of the
preceding patches all add up to a rather involved way of coercing an
automatic allocator to only "allocate" predetermined addresses in an
entirely known-ahead-of-time manner.

agreed
  Given that the guy calling

iommu_dma_bind_doorbell() could seemingly just as easily call
iommu_map() at that point and not bother with an allocator cookie and
all this machinery at all, what am I missing?

Well iommu_dma_map_msi_msg() gets called and is part of this existing
MSI mapping machinery. If we do not do anything this function allocates
an hIOVA that is not involved in any nested setup. So either we coerce
the allocator in place (which is what this series does) or we unplug the
allocator to replace this latter with a simple S2 mapping, as you
suggest, ie. iommu_map(gDB, hDB). Assuming we unplug the allocator, the
guy who actually calls  iommu_dma_bind_doorbell() knows gDB but does not
know hDB. So I don't really get how we can simplify things.


OK, there's what I was missing :D

But that then seems to reveal a somewhat bigger problem - if the callers 
are simply registering IPAs, and relying on the ITS driver to grab an 
entry and fill in a PA later, then how does either one know *which* PA 
is supposed to belong to a given IPA in the case where you have multiple 
devices with different ITS targets assigned to the same guest? (and if 
it's possible to assume a guest will use per-device stage 1 mappings and 
present it with a single vITS backed by multiple pITSes, I think things 
start breaking even harder.)


Other than allowing arbitrary disjoint IOVA pages, I'm not sure this 
really works any differently from the existing MSI cookie now that I 
look more closely :/


Robin.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC v2 12/20] dma-iommu: Implement NESTED_MSI cookie

2018-10-24 Thread Auger Eric
Hi Robin,

On 10/24/18 8:02 PM, Robin Murphy wrote:
> Hi Eric,
> 
> On 2018-09-18 3:24 pm, Eric Auger wrote:
>> Up to now, when the type was UNMANAGED, we used to
>> allocate IOVA pages within a range provided by the user.
>> This does not work in nested mode.
>>
>> If both the host and the guest are exposed with SMMUs, each
>> would allocate an IOVA. The guest allocates an IOVA (gIOVA)
>> to map onto the guest MSI doorbell (gDB). The Host allocates
>> another IOVA (hIOVA) to map onto the physical doorbell (hDB).
>>
>> So we end up with 2 unrelated mappings, at S1 and S2:
>>   S1 S2
>> gIOVA    -> gDB
>>     hIOVA    ->    hDB
>>
>> The PCI device would be programmed with hIOVA.
>>
>> iommu_dma_bind_doorbell allows to pass gIOVA/gDB to the host
>> so that gIOVA can be used by the host instead of re-allocating
>> a new IOVA. That way the host can create the following nested
>> mapping:
>>
>>   S1   S2
>> gIOVA    ->    gDB    ->    hDB
>>
>> this time, the PCI device will be programmed with the gIOVA MSI
>> doorbell which is correctly map through the 2 stages.
> 
> If I'm understanding things correctly, this plus a couple of the
> preceding patches all add up to a rather involved way of coercing an
> automatic allocator to only "allocate" predetermined addresses in an
> entirely known-ahead-of-time manner.
agreed
 Given that the guy calling
> iommu_dma_bind_doorbell() could seemingly just as easily call
> iommu_map() at that point and not bother with an allocator cookie and
> all this machinery at all, what am I missing?
Well iommu_dma_map_msi_msg() gets called and is part of this existing
MSI mapping machinery. If we do not do anything this function allocates
an hIOVA that is not involved in any nested setup. So either we coerce
the allocator in place (which is what this series does) or we unplug the
allocator to replace this latter with a simple S2 mapping, as you
suggest, ie. iommu_map(gDB, hDB). Assuming we unplug the allocator, the
guy who actually calls  iommu_dma_bind_doorbell() knows gDB but does not
know hDB. So I don't really get how we can simplify things.

Thanks

Eric

> 
> Robin.
> 
>>
>> Signed-off-by: Eric Auger 
>>
>> ---
>>
>> v1 -> v2:
>> - unmap stage2 on put()
>> ---
>>   drivers/iommu/dma-iommu.c | 97 +--
>>   include/linux/dma-iommu.h | 11 +
>>   2 files changed, 105 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>> index 511ff9a1d6d9..53444c3e8f2f 100644
>> --- a/drivers/iommu/dma-iommu.c
>> +++ b/drivers/iommu/dma-iommu.c
>> @@ -37,12 +37,14 @@
>>   struct iommu_dma_msi_page {
>>   struct list_head    list;
>>   dma_addr_t    iova;
>> +    dma_addr_t    ipa;
>>   phys_addr_t    phys;
>>   };
>>     enum iommu_dma_cookie_type {
>>   IOMMU_DMA_IOVA_COOKIE,
>>   IOMMU_DMA_MSI_COOKIE,
>> +    IOMMU_DMA_NESTED_MSI_COOKIE,
>>   };
>>     struct iommu_dma_cookie {
>> @@ -109,14 +111,17 @@ EXPORT_SYMBOL(iommu_get_dma_cookie);
>>    *
>>    * Users who manage their own IOVA allocation and do not want DMA
>> API support,
>>    * but would still like to take advantage of automatic MSI
>> remapping, can use
>> - * this to initialise their own domain appropriately. Users should
>> reserve a
>> + * this to initialise their own domain appropriately. Users may
>> reserve a
>>    * contiguous IOVA region, starting at @base, large enough to
>> accommodate the
>>    * number of PAGE_SIZE mappings necessary to cover every MSI
>> doorbell address
>> - * used by the devices attached to @domain.
>> + * used by the devices attached to @domain. The other way round is to
>> provide
>> + * usable iova pages through the iommu_dma_bind_doorbell API (nested
>> stages
>> + * use case)
>>    */
>>   int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
>>   {
>>   struct iommu_dma_cookie *cookie;
>> +    int nesting, ret;
>>     if (domain->type != IOMMU_DOMAIN_UNMANAGED)
>>   return -EINVAL;
>> @@ -124,7 +129,12 @@ int iommu_get_msi_cookie(struct iommu_domain
>> *domain, dma_addr_t base)
>>   if (domain->iova_cookie)
>>   return -EEXIST;
>>   -    cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);
>> +    ret =  iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, );
>> +    if (!ret && nesting)
>> +    cookie = cookie_alloc(IOMMU_DMA_NESTED_MSI_COOKIE);
>> +    else
>> +    cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);
>> +
>>   if (!cookie)
>>   return -ENOMEM;
>>   @@ -145,6 +155,7 @@ void iommu_put_dma_cookie(struct iommu_domain
>> *domain)
>>   {
>>   struct iommu_dma_cookie *cookie = domain->iova_cookie;
>>   struct iommu_dma_msi_page *msi, *tmp;
>> +    bool s2_unmap = false;
>>     if (!cookie)
>>   return;
>> @@ -152,7 +163,15 @@ void iommu_put_dma_cookie(struct iommu_domain
>> *domain)
>>   if (cookie->type == IOMMU_DMA_IOVA_COOKIE && 

Re: [RFC v2 12/20] dma-iommu: Implement NESTED_MSI cookie

2018-10-24 Thread Robin Murphy

Hi Eric,

On 2018-09-18 3:24 pm, Eric Auger wrote:

Up to now, when the type was UNMANAGED, we used to
allocate IOVA pages within a range provided by the user.
This does not work in nested mode.

If both the host and the guest are exposed with SMMUs, each
would allocate an IOVA. The guest allocates an IOVA (gIOVA)
to map onto the guest MSI doorbell (gDB). The Host allocates
another IOVA (hIOVA) to map onto the physical doorbell (hDB).

So we end up with 2 unrelated mappings, at S1 and S2:
  S1 S2
gIOVA-> gDB
hIOVA->hDB

The PCI device would be programmed with hIOVA.

iommu_dma_bind_doorbell allows to pass gIOVA/gDB to the host
so that gIOVA can be used by the host instead of re-allocating
a new IOVA. That way the host can create the following nested
mapping:

  S1   S2
gIOVA->gDB->hDB

this time, the PCI device will be programmed with the gIOVA MSI
doorbell which is correctly map through the 2 stages.


If I'm understanding things correctly, this plus a couple of the 
preceding patches all add up to a rather involved way of coercing an 
automatic allocator to only "allocate" predetermined addresses in an 
entirely known-ahead-of-time manner. Given that the guy calling 
iommu_dma_bind_doorbell() could seemingly just as easily call 
iommu_map() at that point and not bother with an allocator cookie and 
all this machinery at all, what am I missing?


Robin.



Signed-off-by: Eric Auger 

---

v1 -> v2:
- unmap stage2 on put()
---
  drivers/iommu/dma-iommu.c | 97 +--
  include/linux/dma-iommu.h | 11 +
  2 files changed, 105 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 511ff9a1d6d9..53444c3e8f2f 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -37,12 +37,14 @@
  struct iommu_dma_msi_page {
struct list_headlist;
dma_addr_t  iova;
+   dma_addr_t  ipa;
phys_addr_t phys;
  };
  
  enum iommu_dma_cookie_type {

IOMMU_DMA_IOVA_COOKIE,
IOMMU_DMA_MSI_COOKIE,
+   IOMMU_DMA_NESTED_MSI_COOKIE,
  };
  
  struct iommu_dma_cookie {

@@ -109,14 +111,17 @@ EXPORT_SYMBOL(iommu_get_dma_cookie);
   *
   * Users who manage their own IOVA allocation and do not want DMA API support,
   * but would still like to take advantage of automatic MSI remapping, can use
- * this to initialise their own domain appropriately. Users should reserve a
+ * this to initialise their own domain appropriately. Users may reserve a
   * contiguous IOVA region, starting at @base, large enough to accommodate the
   * number of PAGE_SIZE mappings necessary to cover every MSI doorbell address
- * used by the devices attached to @domain.
+ * used by the devices attached to @domain. The other way round is to provide
+ * usable iova pages through the iommu_dma_bind_doorbell API (nested stages
+ * use case)
   */
  int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
  {
struct iommu_dma_cookie *cookie;
+   int nesting, ret;
  
  	if (domain->type != IOMMU_DOMAIN_UNMANAGED)

return -EINVAL;
@@ -124,7 +129,12 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, 
dma_addr_t base)
if (domain->iova_cookie)
return -EEXIST;
  
-	cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);

+   ret =  iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, );
+   if (!ret && nesting)
+   cookie = cookie_alloc(IOMMU_DMA_NESTED_MSI_COOKIE);
+   else
+   cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);
+
if (!cookie)
return -ENOMEM;
  
@@ -145,6 +155,7 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)

  {
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iommu_dma_msi_page *msi, *tmp;
+   bool s2_unmap = false;
  
  	if (!cookie)

return;
@@ -152,7 +163,15 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
if (cookie->type == IOMMU_DMA_IOVA_COOKIE && cookie->iovad.granule)
put_iova_domain(>iovad);
  
+	if (cookie->type == IOMMU_DMA_NESTED_MSI_COOKIE)

+   s2_unmap = true;
+
list_for_each_entry_safe(msi, tmp, >msi_page_list, list) {
+   if (s2_unmap && msi->phys) {
+   size_t size = cookie_msi_granule(cookie);
+
+   WARN_ON(iommu_unmap(domain, msi->ipa, size) != size);
+   }
list_del(>list);
kfree(msi);
}
@@ -161,6 +180,50 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
  }
  EXPORT_SYMBOL(iommu_put_dma_cookie);
  
+/**

+ * iommu_dma_bind_doorbell - Allows to provide a usable IOVA page
+ * @domain: domain handle
+ * @binding: IOVA/IPA binding
+ *
+ * In nested stage use case, the user can provide IOVA/IPA bindings
+ * corresponding to a guest MSI 

Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-24 Thread Kirill A. Shutemov
On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index 9e68a02a52b1..2fd163cff406 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, 
> > > pmd_t *old_pmd,
> > >   drop_rmap_locks(vma);
> > >  }
> > >  
> > > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long 
> > > old_addr,
> > > +   unsigned long new_addr, unsigned long old_end,
> > > +   pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > > +{
> > > + spinlock_t *old_ptl, *new_ptl;
> > > + struct mm_struct *mm = vma->vm_mm;
> > > +
> > > + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > > + || old_end - old_addr < PMD_SIZE)
> > > + return false;
> > > +
> > > + /*
> > > +  * The destination pmd shouldn't be established, free_pgtables()
> > > +  * should have release it.
> > > +  */
> > > + if (WARN_ON(!pmd_none(*new_pmd)))
> > > + return false;
> > > +
> > > + /*
> > > +  * We don't have to worry about the ordering of src and dst
> > > +  * ptlocks because exclusive mmap_sem prevents deadlock.
> > > +  */
> > > + old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > > + if (old_ptl) {
> > 
> > How can it ever be false?
> > 
> > > + pmd_t pmd;
> > > +
> > > + new_ptl = pmd_lockptr(mm, new_pmd);
> 
> 
> Looks like this is largely inspired by move_huge_pmd(), I guess a lot of
> the code applies, why not just reuse as much as possible? The same comments
> w.r.t mmap_sem helping protect against lock order issues applies as well.

pmd_lock() cannot fail, but __pmd_trans_huge_lock() can. We should not
copy the code blindly.

-- 
 Kirill A. Shutemov
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-24 Thread Balbir Singh
On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 9e68a02a52b1..2fd163cff406 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, 
> > pmd_t *old_pmd,
> > drop_rmap_locks(vma);
> >  }
> >  
> > +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long 
> > old_addr,
> > + unsigned long new_addr, unsigned long old_end,
> > + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> > +{
> > +   spinlock_t *old_ptl, *new_ptl;
> > +   struct mm_struct *mm = vma->vm_mm;
> > +
> > +   if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> > +   || old_end - old_addr < PMD_SIZE)
> > +   return false;
> > +
> > +   /*
> > +* The destination pmd shouldn't be established, free_pgtables()
> > +* should have release it.
> > +*/
> > +   if (WARN_ON(!pmd_none(*new_pmd)))
> > +   return false;
> > +
> > +   /*
> > +* We don't have to worry about the ordering of src and dst
> > +* ptlocks because exclusive mmap_sem prevents deadlock.
> > +*/
> > +   old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> > +   if (old_ptl) {
> 
> How can it ever be false?
> 
> > +   pmd_t pmd;
> > +
> > +   new_ptl = pmd_lockptr(mm, new_pmd);


Looks like this is largely inspired by move_huge_pmd(), I guess a lot of
the code applies, why not just reuse as much as possible? The same comments
w.r.t mmap_sem helping protect against lock order issues applies as well.

> > +   if (new_ptl != old_ptl)
> > +   spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> > +
> > +   /* Clear the pmd */
> > +   pmd = *old_pmd;
> > +   pmd_clear(old_pmd);
> > +
> > +   VM_BUG_ON(!pmd_none(*new_pmd));
> > +
> > +   /* Set the new pmd */
> > +   set_pmd_at(mm, new_addr, new_pmd, pmd);
> > +   if (new_ptl != old_ptl)
> > +   spin_unlock(new_ptl);
> > +   spin_unlock(old_ptl);
> > +
> > +   *need_flush = true;
> > +   return true;
> > +   }
> > +   return false;
> > +}
> > +
> -- 
>  Kirill A. Shutemov
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-24 Thread Kirill A. Shutemov
On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 9e68a02a52b1..2fd163cff406 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t 
> *old_pmd,
>   drop_rmap_locks(vma);
>  }
>  
> +static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long 
> old_addr,
> +   unsigned long new_addr, unsigned long old_end,
> +   pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
> +{
> + spinlock_t *old_ptl, *new_ptl;
> + struct mm_struct *mm = vma->vm_mm;
> +
> + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
> + || old_end - old_addr < PMD_SIZE)
> + return false;
> +
> + /*
> +  * The destination pmd shouldn't be established, free_pgtables()
> +  * should have release it.
> +  */
> + if (WARN_ON(!pmd_none(*new_pmd)))
> + return false;
> +
> + /*
> +  * We don't have to worry about the ordering of src and dst
> +  * ptlocks because exclusive mmap_sem prevents deadlock.
> +  */
> + old_ptl = pmd_lock(vma->vm_mm, old_pmd);
> + if (old_ptl) {

How can it ever be false?

> + pmd_t pmd;
> +
> + new_ptl = pmd_lockptr(mm, new_pmd);
> + if (new_ptl != old_ptl)
> + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> +
> + /* Clear the pmd */
> + pmd = *old_pmd;
> + pmd_clear(old_pmd);
> +
> + VM_BUG_ON(!pmd_none(*new_pmd));
> +
> + /* Set the new pmd */
> + set_pmd_at(mm, new_addr, new_pmd, pmd);
> + if (new_ptl != old_ptl)
> + spin_unlock(new_ptl);
> + spin_unlock(old_ptl);
> +
> + *need_flush = true;
> + return true;
> + }
> + return false;
> +}
> +
-- 
 Kirill A. Shutemov
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v5 11/17] arm64: docs: document pointer authentication

2018-10-24 Thread Ramana Radhakrishnan




On 19/10/2018 12:35, Catalin Marinas wrote:

On Tue, Oct 16, 2018 at 05:14:39PM +0100, Kristina Martsenko wrote:

On 05/10/2018 10:04, Ramana Radhakrishnan wrote:

On 05/10/2018 09:47, Kristina Martsenko wrote:

+Virtualization
+--
+
+Pointer authentication is not currently supported in KVM guests. KVM
+will mask the feature bits from ID_AA64ISAR1_EL1, and attempted use of
+the feature will result in an UNDEFINED exception being injected into
+the guest.


However applications using instructions from the hint space will
continue to work albeit without any protection (as they would just be
nops) ?


Mostly, yes. If the guest leaves SCTLR_EL1.EnIA unset (and
EnIB/EnDA/EnDB), then PAC* and AUT* instructions in the HINT space will
execute as NOPs. If the guest sets EnIA, then PAC*/AUT* instructions
will trap and KVM will inject an "Unknown reason" exception into the
guest (which will cause a Linux guest to send a SIGILL to the application).


I think that part is fine. If KVM (a fairly recent version with CPUID
sanitisation) does not enable ptr auth, the CPUID should not advertise
this feature either so the guest kernel should not enable it. For the
above instructions in the HINT space, they will just be NOPs. If the
guest kernel enables the feature regardless of the CPUID information, it
deserves to get an "Unknown reason" exception.


In the latter case we could instead pretend the instruction was a NOP
and not inject an exception, but trapping twice per every function would
probably be terrible for performance. The guest shouldn't be setting
EnIA anyway if ID_AA64ISAR1_EL1 reports that pointer authentication is
not present (because KVM has hidden it).


I don't think we should. The SCTLR_EL1 bits are RES0 unless you know
that the feature is present via CPUID.


The other special case is the XPACLRI instruction, which is also in the
HINT space. Currently it will trap and KVM will inject an exception into
the guest. We should probably change this to NOP instead, as that's what
applications will expect. Unfortunately there is no EnIA-like control to
make it NOP.


Very good catch. Basically if EL2 doesn't know about ptr auth (older
distro), EL1 may or may not know but leaves SCTLR_EL1 disabled (based on
CPUID), the default HCR_EL2 is to trap (I'm ignoring EL3 as that's like
to have ptr auth enabled, being built for the specific HW). So a user
app considering XPACLRI a NOP (or inoffensive) will get a SIGILL
(injected by the guest kernel following the injection of "Unknown
reason" exception by KVM).

Ramana, is XPACLRI commonly generated by gcc and expects it to be a NOP?
Could we restrict it to only being used at run-time if the corresponding
HWCAP is set? This means redefining this instruction as no longer in the
NOP space.


Sorry to have missed this - I'm still catching up on email.

XPACLRI is used in the unwinder in exactly 2 places but not for 
unwinding itself but for storing the actual return address in the data 
structures, its not something I expect to be used very commonly so a 
check there seems reasonable. The xpaclri is considered a nop in the 
architecture as it is defined today. I don't like the idea of redefining 
instructions as not in the nop space after it's been defined as being 
so. We could investigate guarding the XPACLRI with a check with the 
HWCAP. How many unwinders would you like us to fix ?







One option is for KVM to pretend the instruction was a NOP and return to
the guest. But if XPACLRI gets executed frequently, then the constant
trapping might hurt performance. I don't know how frequently it might
get used, as I don't know of any applications currently using it. From
what I understand, it may be used by userspace stack unwinders.


Yep. Probably one instruction per frame being unwound during exception 
unwinding. And no trapping will be expensive even though it's *only* in 
the exception unwind case.




(Also worth noting - as far as I can tell there is no easy way for KVM
to know which pointer authentication instruction caused the trap, so we
may have to do something unusual like use "at s12e1r" to read guest
memory and check for XPACLRI.)


Indeed, it's not an easy fix. As discussed (in the office), we can't
even guarantee that the guest stage 1 translation is stable and points
to the actual XPACLRI instruction.


The other option is to turn off trapping entirely. However then on a
big.LITTLE system with mismatched pointer authentication support
instructions will work intermittently on some CPUs but not others.


That's another case but let's assume we never see such configurations ;).


That's a broken system by design :) !

Ramana



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/4] treewide: remove unused address argument from pte_alloc functions (v2)

2018-10-24 Thread Peter Zijlstra
On Fri, Oct 12, 2018 at 06:31:57PM -0700, Joel Fernandes (Google) wrote:
> This series speeds up mremap(2) syscall by copying page tables at the
> PMD level even for non-THP systems. There is concern that the extra
> 'address' argument that mremap passes to pte_alloc may do something
> subtle architecture related in the future that may make the scheme not
> work.  Also we find that there is no point in passing the 'address' to
> pte_alloc since its unused. So this patch therefore removes this
> argument tree-wide resulting in a nice negative diff as well. Also
> ensuring along the way that the enabled architectures do not do anything
> funky with 'address' argument that goes unnoticed by the optimization.

Did you happen to look at the history of where that address argument
came from? -- just being curious here. ISTR something vague about
architectures having different paging structure for different memory
ranges.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm