Re: [Xen-devel] [PATCH v5 10/10] AMD/IOMMU: restrict interrupt remapping table sizes

2019-09-18 Thread Jan Beulich
On 18.09.2019 12:45, Andrew Cooper wrote:
> On 18/09/2019 09:11, Jan Beulich wrote:
>> On 17.09.2019 17:30, Andrew Cooper wrote:
>>> On 06/08/2019 14:11, Jan Beulich wrote:
 There's no point setting up tables with more space than a PCI device can
 use. For both MSI and MSI-X we can determine how many interrupts could
 be set up at most. Tables allocated during ACPI table parsing, however,
 will (for now at least) continue to be set up to have maximum size.

 Note that until we would want to use sub-page allocations here there's
 no point checking whether MSI is supported by a device - 1 or up to 32
 (or actually 128, due to the change effectively using a reserved
 encoding) IRTEs always mean an order-0 allocation anyway.
>>> Devices which are not MSI-capable don't need an interrupt remapping
>>> table at all.
>> Oh, good point - pin based interrupts use the respective IO-APIC's
>> IRTE.
> 
> A lot of these devices have no interrupt capabilities at all.

Right, but this would be harder to tell than "is neither MSI-X nor
MSI capable".

 --- a/xen/drivers/passthrough/amd/iommu_init.c
 +++ b/xen/drivers/passthrough/amd/iommu_init.c
 @@ -1315,11 +1317,8 @@ static int __init amd_iommu_setup_device
  }
  
  amd_iommu_set_intremap_table(
 -    dte,
 -    ivrs_mappings[bdf].intremap_table
 -    ? virt_to_maddr(ivrs_mappings[bdf].intremap_table)
 -    : 0,
 -    iommu_intremap);
 +    dte, ivrs_mappings[bdf].intremap_table,
 +    ivrs_mappings[bdf].iommu, iommu_intremap);
>>> Ah - half of this looks like it wants to be in patch 6, rather than here.
>> Hmm, which half?
> 
> The dropping of the ternary expression.
> 
>> I don't see anything misplaced here. The signature
>> of amd_iommu_set_intremap_table) changes only in this patch, not in
>> patch 6.
> 
> If the code isn't misplaced, I can't spot why it is necessary before
> this patch.

As explained in the context of the patch introducing it - there
is a way for ivrs_mappings[bdf].intremap_table to be NULL here.
The handling of this case merely gets moved into
amd_iommu_set_intremap_table() by the patch here.

 @@ -80,17 +81,13 @@ unsigned int nr_ioapic_sbdf;
  
  static void dump_intremap_tables(unsigned char key);
  
 -static unsigned int __init intremap_table_order(const struct
 amd_iommu *iommu)
 -{
 -    return iommu->ctrl.ga_en
 -   ? get_order_from_bytes(INTREMAP_MAX_ENTRIES * sizeof(union
 irte128))
 -   : get_order_from_bytes(INTREMAP_MAX_ENTRIES * sizeof(union
 irte32));
 -}
 +#define intremap_page_order(irt) PFN_ORDER(virt_to_page(irt))
>>> What makes the frameable order field safe to use?  It reaches into
>>> (pg)->v.free.order which fairly obviously isn't safe for allocated pages.
>> The same argument that allows xmalloc_whole_pages() and xfree() to
>> use this field: It is the owner of a page who controls how the
>> individual sub-fields of a union get used. As long as v.inuse and
>> v.sh don't get used, (ab)using v.free for an allocated page is
>> quite fine.
> 
> In which case I think v.free needs renaming and/or the comment for
> PFN_ORDER() needs rewriting.
> 
> The current code/documentation does not give the impression that the
> current uses of PFN_ORDER() are safe.
> 
> Judging by the other users (particularly the IOMMU code), it would be
> best in a field called opaque or similar.

Well, perhaps. But I don't fancy doing this in this series.

>>> virt_to_page() is a non-trivial calculation, which is now used in a
>>> large number of circumstances.  I don't have an easy judgement of
>>> whether they are hotpaths, but surely it would be easier to just store
>>> another unsigned int per device.
>> Except this would be a vendor specific field in a supposedly
>> vendor agnostic structure. I'm not very keen to add such a field.
>> Also I don't think interrupt setup/teardown paths would normally
>> be considered "hot".
>>
>> What I could be talked into (to limit code size in case the
>> compiler doesn't expand things inline overly aggressively) is to
>> make this a static function instead of a macro.
> 
> I considered asking for that, but I expected not to get very far, given
> that the use of PFN_ORDER() as an lvalue seems to be the prevailing idiom.

Oh, of course - that's why it's a macro in the first place. It's
been a long while since I put together this change ...

>>> Furthermore, it would work around a preexisting issue where we can
>>> allocate beyond the number of interrupts for the device, up to the next
>>> order boundary.
>> Is this really an "issue", rather than just an irrelevant side
>> effect (which is never going to be hit as long as other layers
>> work correctly, in that they bound requests appropriately)?
> 
> Its not major, certainly (and I wouldn't object to the 

Re: [Xen-devel] [PATCH v5 10/10] AMD/IOMMU: restrict interrupt remapping table sizes

2019-09-18 Thread Andrew Cooper
On 18/09/2019 09:11, Jan Beulich wrote:
> On 17.09.2019 17:30, Andrew Cooper wrote:
>> On 06/08/2019 14:11, Jan Beulich wrote:
>>> There's no point setting up tables with more space than a PCI device can
>>> use. For both MSI and MSI-X we can determine how many interrupts could
>>> be set up at most. Tables allocated during ACPI table parsing, however,
>>> will (for now at least) continue to be set up to have maximum size.
>>>
>>> Note that until we would want to use sub-page allocations here there's
>>> no point checking whether MSI is supported by a device - 1 or up to 32
>>> (or actually 128, due to the change effectively using a reserved
>>> encoding) IRTEs always mean an order-0 allocation anyway.
>> Devices which are not MSI-capable don't need an interrupt remapping
>> table at all.
> Oh, good point - pin based interrupts use the respective IO-APIC's
> IRTE.

A lot of these devices have no interrupt capabilities at all.

>
>> Per my calculations, the Rome SDP has 62 devices with MSI/MSI-X support,
>> and 98 devices which are CPU-internals that have no interrupt support at
>> all.
>>
>> In comparison, for a production Cascade Lake system I have to hand, the
>> stats are 92 non-MSI devices and 18 MSI-capable devices (which isn't a
>> valid direct comparison due to how VT-d's remapping tables work, but is
>> a datapoint on "similar looking systems").
>>
>> I'm happy to leave "no IRT's for non-capable devices" for future work,
>> but at the very least, I don't think the commit message wants phrasing
>> in exactly this way.
> I think it would be better to correct this right away, before it
> goes in. I don't think it'll be overly much of a change to add an
> MSI capability check next to the MSI-X one.

Ok.  Even better.

>
>>> --- a/xen/drivers/passthrough/amd/iommu_init.c
>>> +++ b/xen/drivers/passthrough/amd/iommu_init.c
>>> @@ -1315,11 +1317,8 @@ static int __init amd_iommu_setup_device
>>>  }
>>>  
>>>  amd_iommu_set_intremap_table(
>>> -    dte,
>>> -    ivrs_mappings[bdf].intremap_table
>>> -    ? virt_to_maddr(ivrs_mappings[bdf].intremap_table)
>>> -    : 0,
>>> -    iommu_intremap);
>>> +    dte, ivrs_mappings[bdf].intremap_table,
>>> +    ivrs_mappings[bdf].iommu, iommu_intremap);
>> Ah - half of this looks like it wants to be in patch 6, rather than here.
> Hmm, which half?

The dropping of the ternary expression.

> I don't see anything misplaced here. The signature
> of amd_iommu_set_intremap_table) changes only in this patch, not in
> patch 6.

If the code isn't misplaced, I can't spot why it is necessary before
this patch.

>
>>> @@ -80,17 +81,13 @@ unsigned int nr_ioapic_sbdf;
>>>  
>>>  static void dump_intremap_tables(unsigned char key);
>>>  
>>> -static unsigned int __init intremap_table_order(const struct
>>> amd_iommu *iommu)
>>> -{
>>> -    return iommu->ctrl.ga_en
>>> -   ? get_order_from_bytes(INTREMAP_MAX_ENTRIES * sizeof(union
>>> irte128))
>>> -   : get_order_from_bytes(INTREMAP_MAX_ENTRIES * sizeof(union
>>> irte32));
>>> -}
>>> +#define intremap_page_order(irt) PFN_ORDER(virt_to_page(irt))
>> What makes the frameable order field safe to use?  It reaches into
>> (pg)->v.free.order which fairly obviously isn't safe for allocated pages.
> The same argument that allows xmalloc_whole_pages() and xfree() to
> use this field: It is the owner of a page who controls how the
> individual sub-fields of a union get used. As long as v.inuse and
> v.sh don't get used, (ab)using v.free for an allocated page is
> quite fine.

In which case I think v.free needs renaming and/or the comment for
PFN_ORDER() needs rewriting.

The current code/documentation does not give the impression that the
current uses of PFN_ORDER() are safe.

Judging by the other users (particularly the IOMMU code), it would be
best in a field called opaque or similar.

>
>> virt_to_page() is a non-trivial calculation, which is now used in a
>> large number of circumstances.  I don't have an easy judgement of
>> whether they are hotpaths, but surely it would be easier to just store
>> another unsigned int per device.
> Except this would be a vendor specific field in a supposedly
> vendor agnostic structure. I'm not very keen to add such a field.
> Also I don't think interrupt setup/teardown paths would normally
> be considered "hot".
>
> What I could be talked into (to limit code size in case the
> compiler doesn't expand things inline overly aggressively) is to
> make this a static function instead of a macro.

I considered asking for that, but I expected not to get very far, given
that the use of PFN_ORDER() as an lvalue seems to be the prevailing idiom.

>> Furthermore, it would work around a preexisting issue where we can
>> allocate beyond the number of interrupts for the device, up to the next
>> order boundary.
> Is this really an "issue", rather than just an irrelevant side
> effect (which is 

Re: [Xen-devel] [PATCH v5 10/10] AMD/IOMMU: restrict interrupt remapping table sizes

2019-09-18 Thread Jan Beulich
On 17.09.2019 17:30, Andrew Cooper wrote:
> On 06/08/2019 14:11, Jan Beulich wrote:
>> There's no point setting up tables with more space than a PCI device can
>> use. For both MSI and MSI-X we can determine how many interrupts could
>> be set up at most. Tables allocated during ACPI table parsing, however,
>> will (for now at least) continue to be set up to have maximum size.
>>
>> Note that until we would want to use sub-page allocations here there's
>> no point checking whether MSI is supported by a device - 1 or up to 32
>> (or actually 128, due to the change effectively using a reserved
>> encoding) IRTEs always mean an order-0 allocation anyway.
> 
> Devices which are not MSI-capable don't need an interrupt remapping
> table at all.

Oh, good point - pin based interrupts use the respective IO-APIC's
IRTE.

> Per my calculations, the Rome SDP has 62 devices with MSI/MSI-X support,
> and 98 devices which are CPU-internals that have no interrupt support at
> all.
> 
> In comparison, for a production Cascade Lake system I have to hand, the
> stats are 92 non-MSI devices and 18 MSI-capable devices (which isn't a
> valid direct comparison due to how VT-d's remapping tables work, but is
> a datapoint on "similar looking systems").
> 
> I'm happy to leave "no IRT's for non-capable devices" for future work,
> but at the very least, I don't think the commit message wants phrasing
> in exactly this way.

I think it would be better to correct this right away, before it
goes in. I don't think it'll be overly much of a change to add an
MSI capability check next to the MSI-X one.

>> --- a/xen/drivers/passthrough/amd/iommu_init.c
>> +++ b/xen/drivers/passthrough/amd/iommu_init.c
>> @@ -1315,11 +1317,8 @@ static int __init amd_iommu_setup_device
>>  }
>>  
>>  amd_iommu_set_intremap_table(
>> -    dte,
>> -    ivrs_mappings[bdf].intremap_table
>> -    ? virt_to_maddr(ivrs_mappings[bdf].intremap_table)
>> -    : 0,
>> -    iommu_intremap);
>> +    dte, ivrs_mappings[bdf].intremap_table,
>> +    ivrs_mappings[bdf].iommu, iommu_intremap);
> 
> Ah - half of this looks like it wants to be in patch 6, rather than here.

Hmm, which half? I don't see anything misplaced here. The signature
of amd_iommu_set_intremap_table) changes only in this patch, not in
patch 6.

>> @@ -80,17 +81,13 @@ unsigned int nr_ioapic_sbdf;
>>  
>>  static void dump_intremap_tables(unsigned char key);
>>  
>> -static unsigned int __init intremap_table_order(const struct
>> amd_iommu *iommu)
>> -{
>> -    return iommu->ctrl.ga_en
>> -   ? get_order_from_bytes(INTREMAP_MAX_ENTRIES * sizeof(union
>> irte128))
>> -   : get_order_from_bytes(INTREMAP_MAX_ENTRIES * sizeof(union
>> irte32));
>> -}
>> +#define intremap_page_order(irt) PFN_ORDER(virt_to_page(irt))
> 
> What makes the frameable order field safe to use?  It reaches into
> (pg)->v.free.order which fairly obviously isn't safe for allocated pages.

The same argument that allows xmalloc_whole_pages() and xfree() to
use this field: It is the owner of a page who controls how the
individual sub-fields of a union get used. As long as v.inuse and
v.sh don't get used, (ab)using v.free for an allocated page is
quite fine.

> virt_to_page() is a non-trivial calculation, which is now used in a
> large number of circumstances.  I don't have an easy judgement of
> whether they are hotpaths, but surely it would be easier to just store
> another unsigned int per device.

Except this would be a vendor specific field in a supposedly
vendor agnostic structure. I'm not very keen to add such a field.
Also I don't think interrupt setup/teardown paths would normally
be considered "hot".

What I could be talked into (to limit code size in case the
compiler doesn't expand things inline overly aggressively) is to
make this a static function instead of a macro.

> Furthermore, it would work around a preexisting issue where we can
> allocate beyond the number of interrupts for the device, up to the next
> order boundary.

Is this really an "issue", rather than just an irrelevant side
effect (which is never going to be hit as long as other layers
work correctly, in that they bound requests appropriately)?

Jan

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v5 10/10] AMD/IOMMU: restrict interrupt remapping table sizes

2019-09-17 Thread Andrew Cooper
On 06/08/2019 14:11, Jan Beulich wrote:
> There's no point setting up tables with more space than a PCI device can
> use. For both MSI and MSI-X we can determine how many interrupts could
> be set up at most. Tables allocated during ACPI table parsing, however,
> will (for now at least) continue to be set up to have maximum size.
>
> Note that until we would want to use sub-page allocations here there's
> no point checking whether MSI is supported by a device - 1 or up to 32
> (or actually 128, due to the change effectively using a reserved
> encoding) IRTEs always mean an order-0 allocation anyway.

Devices which are not MSI-capable don't need an interrupt remapping
table at all.

Per my calculations, the Rome SDP has 62 devices with MSI/MSI-X support,
and 98 devices which are CPU-internals that have no interrupt support at
all.

In comparison, for a production Cascade Lake system I have to hand, the
stats are 92 non-MSI devices and 18 MSI-capable devices (which isn't a
valid direct comparison due to how VT-d's remapping tables work, but is
a datapoint on "similar looking systems").

I'm happy to leave "no IRT's for non-capable devices" for future work,
but at the very least, I don't think the commit message wants phrasing
in exactly this way.

>
> --- a/xen/drivers/passthrough/amd/iommu_init.c
> +++ b/xen/drivers/passthrough/amd/iommu_init.c
> @@ -1315,11 +1317,8 @@ static int __init amd_iommu_setup_device
>  }
>  
>  amd_iommu_set_intremap_table(
> -    dte,
> -    ivrs_mappings[bdf].intremap_table
> -    ? virt_to_maddr(ivrs_mappings[bdf].intremap_table)
> -    : 0,
> -    iommu_intremap);
> +    dte, ivrs_mappings[bdf].intremap_table,
> +    ivrs_mappings[bdf].iommu, iommu_intremap);

Ah - half of this looks like it wants to be in patch 6, rather than here.

>  }
>  }
>  
> --- a/xen/drivers/passthrough/amd/iommu_intr.c
> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
> @@ -69,7 +69,8 @@ union irte_cptr {
>  const union irte128 *ptr128;
>  } __transparent__;
>  
> -#define INTREMAP_MAX_ENTRIES (1 << IOMMU_INTREMAP_ORDER)
> +#define INTREMAP_MAX_ORDER   0xB
> +#define INTREMAP_MAX_ENTRIES (1 << INTREMAP_MAX_ORDER)
>  
>  struct ioapic_sbdf ioapic_sbdf[MAX_IO_APICS];
>  struct hpet_sbdf hpet_sbdf;
> @@ -80,17 +81,13 @@ unsigned int nr_ioapic_sbdf;
>  
>  static void dump_intremap_tables(unsigned char key);
>  
> -static unsigned int __init intremap_table_order(const struct
> amd_iommu *iommu)
> -{
> -    return iommu->ctrl.ga_en
> -   ? get_order_from_bytes(INTREMAP_MAX_ENTRIES * sizeof(union
> irte128))
> -   : get_order_from_bytes(INTREMAP_MAX_ENTRIES * sizeof(union
> irte32));
> -}
> +#define intremap_page_order(irt) PFN_ORDER(virt_to_page(irt))

What makes the frameable order field safe to use?  It reaches into
(pg)->v.free.order which fairly obviously isn't safe for allocated pages.

virt_to_page() is a non-trivial calculation, which is now used in a
large number of circumstances.  I don't have an easy judgement of
whether they are hotpaths, but surely it would be easier to just store
another unsigned int per device.

Furthermore, it would work around a preexisting issue where we can
allocate beyond the number of interrupts for the device, up to the next
order boundary.

> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -471,16 +471,15 @@ static int amd_iommu_add_device(u8 devfn
>  {
>  ivrs_mappings[bdf].intremap_table =
>  amd_iommu_alloc_intremap_table(
> -    iommu, _mappings[bdf].intremap_inuse);
> +    iommu, _mappings[bdf].intremap_inuse,
> +    pdev->msix ? pdev->msix->nr_entries
> +   : multi_msi_capable(~0u));
>  if ( !ivrs_mappings[bdf].intremap_table )
>  return -ENOMEM;
>  
>  amd_iommu_set_intremap_table(
>  iommu->dev_table.buffer + (bdf *
> IOMMU_DEV_TABLE_ENTRY_SIZE),
> -    ivrs_mappings[bdf].intremap_table
> -    ? virt_to_maddr(ivrs_mappings[bdf].intremap_table)
> -    : 0,
> -    iommu_intremap);
> +    ivrs_mappings[bdf].intremap_table, iommu, iommu_intremap);
>  

Similarly for patch 6 here.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel