Re: [Xen-devel] [PATCH v2] x86/mm/p2m: don't needlessly limit MMIO mapping order to 4k

2018-10-25 Thread Paul Durrant
> -Original Message-
> From: Xen-devel [mailto:xen-devel-boun...@lists.xenproject.org] On Behalf
> Of Jan Beulich
> Sent: 25 October 2018 15:28
> To: Paul Durrant 
> Cc: Andrew Cooper ; Wei Liu
> ; xen-devel 
> Subject: Re: [Xen-devel] [PATCH v2] x86/mm/p2m: don't needlessly limit
> MMIO mapping order to 4k
> 
> >>> On 17.10.18 at 16:24,  wrote:
> > --- a/xen/arch/x86/mm/p2m.c
> > +++ b/xen/arch/x86/mm/p2m.c
> > @@ -2081,14 +2081,11 @@ static unsigned int mmio_order(const struct
> domain *d,
> > unsigned long start_fn, unsigned long
> nr)
> >  {
> >  /*
> > - * Note that the !iommu_use_hap_pt() here has three effects:
> > - * - cover iommu_{,un}map_page() not having an "order" input yet,
> > - * - exclude shadow mode (which doesn't support large MMIO
> mappings),
> > - * - exclude PV guests, should execution reach this code for such.
> > - * So be careful when altering this.
> > + * PV guests or shadow-mode HVM guests must be restricted to 4k
> > + * mappings.
> 
> Since you've already posted a patch to add order parameters to
> IOMMU map/unmap, I'd prefer the respective part of the comment
> to go away only when the order value really can be passed through.
> This then requires permitting non-zero values only when the IOMMUs
> also allow for the respective page sizes.
> 

Ok. I can do that. I'll have to provide some way of querying the IOMMU 
implementation for which orders it supports in that case.

> I am in particular not convinced that the time needed to carry out
> the hypercall is going to be low enough even for 512 4k pages: You
> need to take into account flushes, including those potentially
> needed for ATS. You don't provide any proof that flushes are
> always delayed and batched, nor do I think this is uniformly the
> case.
> 

In testing, doing the 512 4k loop does not seem to cause issues with either 
VT-d or AMD IOMMU. The AMD IOMMU case is still significantly slower than VT-d 
though so I'm currently experimenting with eliding the flushing as well as 
removing the page merging code to speed it up.

> > @@ -2096,8 +2093,12 @@ static unsigned int mmio_order(const struct
> domain *d,
> >  * set_typed_p2m_entry() when it needs to zap M2P entries
> >  * for a RAM range.
> >  */ &&
> > - !(start_fn & ((1UL << PAGE_ORDER_1G) - 1)) && (nr >>
> PAGE_ORDER_1G) &&
> > - hap_has_1gb )
> > + !(start_fn & ((1UL << PAGE_ORDER_1G) - 1)) &&
> > + (nr >> PAGE_ORDER_1G) &&
> > + hap_has_1gb &&
> > + /* disable 1G mappings if we need to keep the IOMMU in sync */
> > + !need_iommu_pt_sync(d)
> 
> I have to admit that I don't understand this v2 addition. How is
> keeping-in-sync relevant for 1G but not 2M pages?
> 

I was attempting to make it obvious that, if the IOMMU mappings needed syncing 
(i.e. they are not shared), that we would not handle a 1G mapping (and thus 
attempt to iterate over the 1G in 4k chunks). 

> > +)
> 
> Misplaced paren.

True.

  Paul

> 
> Jan
> 
> 
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2] x86/mm/p2m: don't needlessly limit MMIO mapping order to 4k

2018-10-25 Thread Jan Beulich
>>> On 25.10.18 at 16:36,  wrote:
> On 25/10/18 15:28, Jan Beulich wrote:
> On 17.10.18 at 16:24,  wrote:
>>> --- a/xen/arch/x86/mm/p2m.c
>>> +++ b/xen/arch/x86/mm/p2m.c
>>> @@ -2081,14 +2081,11 @@ static unsigned int mmio_order(const struct domain 
> *d,
>>> unsigned long start_fn, unsigned long nr)
>>>  {
>>>  /*
>>> - * Note that the !iommu_use_hap_pt() here has three effects:
>>> - * - cover iommu_{,un}map_page() not having an "order" input yet,
>>> - * - exclude shadow mode (which doesn't support large MMIO mappings),
>>> - * - exclude PV guests, should execution reach this code for such.
>>> - * So be careful when altering this.
>>> + * PV guests or shadow-mode HVM guests must be restricted to 4k
>>> + * mappings.
>> Since you've already posted a patch to add order parameters to
>> IOMMU map/unmap, I'd prefer the respective part of the comment
>> to go away only when the order value really can be passed through.
>> This then requires permitting non-zero values only when the IOMMUs
>> also allow for the respective page sizes.
>>
>> I am in particular not convinced that the time needed to carry out
>> the hypercall is going to be low enough even for 512 4k pages: You
>> need to take into account flushes, including those potentially
>> needed for ATS. You don't provide any proof that flushes are
>> always delayed and batched, nor do I think this is uniformly the
>> case.
> 
> I haven't had time to pick this back up since v1.
> 
> The long and the short of it is that we allow order 1G loops for regular
> RAM, even in !shared_pt mode.

Do we? CONFIG_DOMU_MAX_ORDER is 9 for both ARM and x86.
That still allows 2M loops, which - as said - I'm worried about for
the time non-batched TLB flushes take.

> From an "interaction with the IOMMU" point of view, mappings over
> regular RAM are no different to mappings over MMIO, so they should
> behave consistently.

I agree in general.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2] x86/mm/p2m: don't needlessly limit MMIO mapping order to 4k

2018-10-25 Thread Andrew Cooper
On 25/10/18 15:28, Jan Beulich wrote:
 On 17.10.18 at 16:24,  wrote:
>> --- a/xen/arch/x86/mm/p2m.c
>> +++ b/xen/arch/x86/mm/p2m.c
>> @@ -2081,14 +2081,11 @@ static unsigned int mmio_order(const struct domain 
>> *d,
>> unsigned long start_fn, unsigned long nr)
>>  {
>>  /*
>> - * Note that the !iommu_use_hap_pt() here has three effects:
>> - * - cover iommu_{,un}map_page() not having an "order" input yet,
>> - * - exclude shadow mode (which doesn't support large MMIO mappings),
>> - * - exclude PV guests, should execution reach this code for such.
>> - * So be careful when altering this.
>> + * PV guests or shadow-mode HVM guests must be restricted to 4k
>> + * mappings.
> Since you've already posted a patch to add order parameters to
> IOMMU map/unmap, I'd prefer the respective part of the comment
> to go away only when the order value really can be passed through.
> This then requires permitting non-zero values only when the IOMMUs
> also allow for the respective page sizes.
>
> I am in particular not convinced that the time needed to carry out
> the hypercall is going to be low enough even for 512 4k pages: You
> need to take into account flushes, including those potentially
> needed for ATS. You don't provide any proof that flushes are
> always delayed and batched, nor do I think this is uniformly the
> case.

I haven't had time to pick this back up since v1.

The long and the short of it is that we allow order 1G loops for regular
RAM, even in !shared_pt mode.

From an "interaction with the IOMMU" point of view, mappings over
regular RAM are no different to mappings over MMIO, so they should
behave consistently.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2] x86/mm/p2m: don't needlessly limit MMIO mapping order to 4k

2018-10-25 Thread Jan Beulich
>>> On 17.10.18 at 16:24,  wrote:
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -2081,14 +2081,11 @@ static unsigned int mmio_order(const struct domain *d,
> unsigned long start_fn, unsigned long nr)
>  {
>  /*
> - * Note that the !iommu_use_hap_pt() here has three effects:
> - * - cover iommu_{,un}map_page() not having an "order" input yet,
> - * - exclude shadow mode (which doesn't support large MMIO mappings),
> - * - exclude PV guests, should execution reach this code for such.
> - * So be careful when altering this.
> + * PV guests or shadow-mode HVM guests must be restricted to 4k
> + * mappings.

Since you've already posted a patch to add order parameters to
IOMMU map/unmap, I'd prefer the respective part of the comment
to go away only when the order value really can be passed through.
This then requires permitting non-zero values only when the IOMMUs
also allow for the respective page sizes.

I am in particular not convinced that the time needed to carry out
the hypercall is going to be low enough even for 512 4k pages: You
need to take into account flushes, including those potentially
needed for ATS. You don't provide any proof that flushes are
always delayed and batched, nor do I think this is uniformly the
case.

> @@ -2096,8 +2093,12 @@ static unsigned int mmio_order(const struct domain *d,
>  * set_typed_p2m_entry() when it needs to zap M2P entries
>  * for a RAM range.
>  */ &&
> - !(start_fn & ((1UL << PAGE_ORDER_1G) - 1)) && (nr >> PAGE_ORDER_1G) 
> &&
> - hap_has_1gb )
> + !(start_fn & ((1UL << PAGE_ORDER_1G) - 1)) &&
> + (nr >> PAGE_ORDER_1G) &&
> + hap_has_1gb &&
> + /* disable 1G mappings if we need to keep the IOMMU in sync */
> + !need_iommu_pt_sync(d)

I have to admit that I don't understand this v2 addition. How is
keeping-in-sync relevant for 1G but not 2M pages?

> +)

Misplaced paren.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2] x86/mm/p2m: don't needlessly limit MMIO mapping order to 4k

2018-10-17 Thread Paul Durrant
The P2M common code currently restricts the MMIO mapping order of any
domain with IOMMU mappings, but that is not using shared tables, to 4k.
This has been shown to have a huge performance cost when passing through
a PCI device with a very large BAR (e.g. NVIDIA P40), increasing the guest
boot time from ~20s to several minutes when iommu=no-sharept is specified
on the Xen command line.

The limitation was added by commit c3c756bd "x86/p2m: use large pages
for MMIO mappings" however the underlying implementations of p2m->set_entry
for Intel and AMD are coded to cope with mapping orders higher than 4k,
even though the IOMMU mapping function is itself currently limited to 4k,
so there is no real need to limit the order passed into the method, other
than to avoid a potential DoS caused by a long running hypercall.

In practice, mmio_order() already strictly disallows 1G mappings since the
if clause in question starts with:

if ( 0 /*
* Don't use 1Gb pages, to limit the iteration count in
* set_typed_p2m_entry() when it needs to zap M2P entries
* for a RAM range.
*/ &&

With this patch applied (and hence 2M mappings in use) the VM boot time is
restored to something reasonable. Not as fast as without iommu=no-sharept,
but within a few seconds of it.

NOTE: This patch takes the opportunity to shorten a couple of > 80
  character lines in the code.

Signed-off-by: Paul Durrant 
Acked-by: George Dunlap 
---
Cc: Jan Beulich 
Cc: Andrew Cooper 
Cc: Wei Liu 

v2:
 - Add an extra to the if clause disallowing 1G mappings to make sure
   they are not used if need_iommu_pt_sync() is true, even though the
   check is currently moot (see main comment).
---
 xen/arch/x86/mm/p2m.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index a00a3c1bff..f972b4819d 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -2081,14 +2081,11 @@ static unsigned int mmio_order(const struct domain *d,
unsigned long start_fn, unsigned long nr)
 {
 /*
- * Note that the !iommu_use_hap_pt() here has three effects:
- * - cover iommu_{,un}map_page() not having an "order" input yet,
- * - exclude shadow mode (which doesn't support large MMIO mappings),
- * - exclude PV guests, should execution reach this code for such.
- * So be careful when altering this.
+ * PV guests or shadow-mode HVM guests must be restricted to 4k
+ * mappings.
  */
-if ( !iommu_use_hap_pt(d) ||
- (start_fn & ((1UL << PAGE_ORDER_2M) - 1)) || !(nr >> PAGE_ORDER_2M) )
+if ( !hap_enabled(d) || (start_fn & ((1UL << PAGE_ORDER_2M) - 1)) ||
+ !(nr >> PAGE_ORDER_2M) )
 return PAGE_ORDER_4K;
 
 if ( 0 /*
@@ -2096,8 +2093,12 @@ static unsigned int mmio_order(const struct domain *d,
 * set_typed_p2m_entry() when it needs to zap M2P entries
 * for a RAM range.
 */ &&
- !(start_fn & ((1UL << PAGE_ORDER_1G) - 1)) && (nr >> PAGE_ORDER_1G) &&
- hap_has_1gb )
+ !(start_fn & ((1UL << PAGE_ORDER_1G) - 1)) &&
+ (nr >> PAGE_ORDER_1G) &&
+ hap_has_1gb &&
+ /* disable 1G mappings if we need to keep the IOMMU in sync */
+ !need_iommu_pt_sync(d)
+)
 return PAGE_ORDER_1G;
 
 if ( hap_has_2mb )
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel