Re: [Qemu-devel] [for-4.2 PATCH 2/2] hw/i386: AMD-Vi IVRS DMA alias support

2019-07-29 Thread Peter Xu
On Mon, Jul 29, 2019 at 01:04:41PM -0600, Alex Williamson wrote:
> On Mon, 29 Jul 2019 16:26:46 +0800
> Peter Xu  wrote:
> 
> > On Fri, Jul 26, 2019 at 06:55:53PM -0600, Alex Williamson wrote:
> > > When we account for DMA aliases in the PCI address space, we can no
> > > longer use a single IVHD entry in the IVRS covering all devices.  We
> > > instead need to walk the PCI bus and create alias ranges when we find
> > > a conventional bus.  These alias ranges cannot overlap with a "Select
> > > All" range (as currently implemented), so we also need to enumerate
> > > each device with IVHD entries.
> > > 
> > > Importantly, the IVHD entries used here include a Device ID, which is
> > > simply the PCI BDF (Bus/Device/Function).  The guest firmware is
> > > responsible for programming bus numbers, so the final revision of this
> > > table depends on the update mechanism (acpi_build_update) to be called
> > > after guest PCI enumeration.  
> > 
> > Ouch... so the ACPI build procedure is after those guest PCI code!
> > Could I ask how do you find this? :) It seems much easier for sure
> > this way...
> 
> I believe this is what MST was referring to with the MCFG update,
> acpi_build() is called from both acpi_setup() and acpi_build_update(),
> the latter is setup in numerous places to be called via a mechanism
> that re-writes the table on certain guest actions.  For instance
> acpi_add_rom_blob() passes this function as a callback which turns into
> a select_cb in fw_cfg, such that (aiui) the tables are updated before
> the guest reads them.  I added some fprintfs in the PCI config write
> path to confirm that the update callback occurs after PCI enumeration
> for both SeaBIOS and OVMF.  Since we seem to have other dependencies on
> this ordering elsewhere, I don't think that the IVRS update is unique
> in this regard.

Agreed.

[...]

> > We've implmented the similar logic for multiple times:
> > 
> >   - When we want to do DMA (pci_requester_id)
> >   - When we want to fetch the DMA address space (the previous patch)
> >   - When we fill in the AMD ACPI table (this patch)
> > 
> > Do you think we can generalize them somehow?  I'm thinking how about
> > we directly fetch RID in the 2nd/3rd use case using pci_requester_id()
> > (which existed already) and simply use it?
> 
> For this patch, I think we could use pci_requester_id() for dev_id_b
> above, but we still need to walk the buses and devices, so it really
> only simplifies the 'if (pci_is_express...' code block above, and
> barely at that since we're at the point in the topology where such a
> decision is made.  For the previous patch, pci_requester_id() returns a
> BDF and that code is executed before bus numbers are programmed.  We
> might still make use of requester_id_cache, but a different interface
> would be necessary.  Also note how pci_req_id_cache_get() assumes we're
> looking for the requester ID as seen from the root bus while
> pci_device_iommu_address_space() is from the bus hosting iommu_fn.
> While these are generally the same in practice, it's not required.  I'd
> therefore tend to leave that to some future consolidation.  I can
> update the comment to include that justification in the previous patch.

Yes, we can work on top in the future if needed.  I see that Michael
already plan to merge this version, then it may not worth a repost for
the comment (unless there will be a repost, we could mark a TODO).

> 
> > > +/*
> > > + * A PCI bus walk, for each PCI host bridge, is necessary to create a
> > > + * complete set of IVHD entries.  Do this into a separate blob so 
> > > that we
> > > + * can calculate the total IVRS table length here and then append 
> > > the new
> > > + * blob further below.  Fall back to an entry covering all devices, 
> > > which
> > > + * is sufficient when no aliases are present.
> > > + */
> > > +object_child_foreach_recursive(object_get_root(),
> > > +   ivrs_host_bridges, ivhd_blob);
> > > +
> > > +if (!ivhd_blob->len) {
> > > +/*
> > > + *   Type 1 device entry reporting all devices
> > > + *   These are 4-byte device entries currently reporting the 
> > > range of
> > > + *   Refer to Spec - Table 95:IVHD Device Entry Type 
> > > Codes(4-byte)
> > > + */
> > > +build_append_int_noprefix(ivhd_blob, 0x001, 4);
> > > +}  
> > 
> > Is there a real use case for ivhd_blob->len==0?
> 
> It was mostly paranoia, but I believe it's really only an Intel
> convention that the PCI host bridge appears as a device on the bus.  It
> seems possible that we could have a host bridge with no devices, in
> which case we'd insert this select-all entry to make the IVRS valid.
> Perhaps in combination with AMD exposing their IOMMU as a device on the
> PCI bus this is not really an issue, but it's a trivial safety net.
> Thanks,

That question was only for curiousity.  This code path will only
trigger when 

Re: [Qemu-devel] [for-4.2 PATCH 2/2] hw/i386: AMD-Vi IVRS DMA alias support

2019-07-29 Thread Alex Williamson
On Mon, 29 Jul 2019 16:26:46 +0800
Peter Xu  wrote:

> On Fri, Jul 26, 2019 at 06:55:53PM -0600, Alex Williamson wrote:
> > When we account for DMA aliases in the PCI address space, we can no
> > longer use a single IVHD entry in the IVRS covering all devices.  We
> > instead need to walk the PCI bus and create alias ranges when we find
> > a conventional bus.  These alias ranges cannot overlap with a "Select
> > All" range (as currently implemented), so we also need to enumerate
> > each device with IVHD entries.
> > 
> > Importantly, the IVHD entries used here include a Device ID, which is
> > simply the PCI BDF (Bus/Device/Function).  The guest firmware is
> > responsible for programming bus numbers, so the final revision of this
> > table depends on the update mechanism (acpi_build_update) to be called
> > after guest PCI enumeration.  
> 
> Ouch... so the ACPI build procedure is after those guest PCI code!
> Could I ask how do you find this? :) It seems much easier for sure
> this way...

I believe this is what MST was referring to with the MCFG update,
acpi_build() is called from both acpi_setup() and acpi_build_update(),
the latter is setup in numerous places to be called via a mechanism
that re-writes the table on certain guest actions.  For instance
acpi_add_rom_blob() passes this function as a callback which turns into
a select_cb in fw_cfg, such that (aiui) the tables are updated before
the guest reads them.  I added some fprintfs in the PCI config write
path to confirm that the update callback occurs after PCI enumeration
for both SeaBIOS and OVMF.  Since we seem to have other dependencies on
this ordering elsewhere, I don't think that the IVRS update is unique
in this regard.
 
> This looks very nice to me already, though I still have got a few
> questions, please see below.
> 
> [...]
> 
> > +if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_BRIDGE)) {
> > +PCIBus *sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(dev));
> > +uint8_t sec = pci_bus_num(sec_bus);
> > +uint8_t sub = dev->config[PCI_SUBORDINATE_BUS];
> > +
> > +if (pci_bus_is_express(sec_bus)) {
> > +/*
> > + * Walk the bus if there are subordinates, otherwise use a 
> > range
> > + * to cover an entire leaf bus.  We could potentially also use 
> > a
> > + * range for traversed buses, but we'd need to take care not to
> > + * create both Select and Range entries covering the same 
> > device.
> > + * This is easier and potentially more compact.
> > + *
> > + * An example bare metal system seems to use Select entries for
> > + * root ports without a slot (ie. built-ins) and Range entries
> > + * when there is a slot.  The same system also only hard-codes
> > + * the alias range for an onboard PCIe-to-PCI bridge, 
> > apparently
> > + * making no effort to support nested bridges.  We attempt to
> > + * be more thorough here.
> > + */
> > +if (sec == sub) { /* leaf bus */
> > +/* "Start of Range" IVHD entry, type 0x3 */
> > +entry = PCI_BUILD_BDF(sec, PCI_DEVFN(0, 0)) << 8 | 0x3;
> > +build_append_int_noprefix(table_data, entry, 4);
> > +/* "End of Range" IVHD entry, type 0x4 */
> > +entry = PCI_BUILD_BDF(sub, PCI_DEVFN(31, 7)) << 8 | 0x4;
> > +build_append_int_noprefix(table_data, entry, 4);
> > +} else {
> > +pci_for_each_device(sec_bus, sec, insert_ivhd, table_data);
> > +}
> > +} else {
> > +/*
> > + * If the secondary bus is conventional, then we need to 
> > create an
> > + * Alias range for everything downstream.  The range covers the
> > + * first devfn on the secondary bus to the last devfn on the
> > + * subordinate bus.  The alias target depends on legacy versus
> > + * express bridges, just as in 
> > pci_device_iommu_address_space().
> > + * DeviceIDa vs DeviceIDb as per the AMD IOMMU spec.
> > + */
> > +uint16_t dev_id_a, dev_id_b;
> > +
> > +dev_id_a = PCI_BUILD_BDF(sec, PCI_DEVFN(0, 0));
> > +
> > +if (pci_is_express(dev) &&
> > +pcie_cap_get_type(dev) == PCI_EXP_TYPE_PCI_BRIDGE) {
> > +dev_id_b = dev_id_a;
> > +} else {
> > +dev_id_b = PCI_BUILD_BDF(pci_bus_num(bus), dev->devfn);
> > +}
> > +
> > +/* "Alias Start of Range" IVHD entry, type 0x43, 8 bytes */
> > +build_append_int_noprefix(table_data, dev_id_a << 8 | 0x43, 4);
> > +build_append_int_noprefix(table_data, dev_id_b << 8 | 0x0, 4);
> > +
> > +/* "End of Range" IVHD entry, type 0x4 */
> > +entry = PCI_BUILD_BDF(sub, PCI_DEVFN(31, 7)) << 8 | 0x4;
> > + 

Re: [Qemu-devel] [for-4.2 PATCH 2/2] hw/i386: AMD-Vi IVRS DMA alias support

2019-07-29 Thread Peter Xu
On Fri, Jul 26, 2019 at 06:55:53PM -0600, Alex Williamson wrote:
> When we account for DMA aliases in the PCI address space, we can no
> longer use a single IVHD entry in the IVRS covering all devices.  We
> instead need to walk the PCI bus and create alias ranges when we find
> a conventional bus.  These alias ranges cannot overlap with a "Select
> All" range (as currently implemented), so we also need to enumerate
> each device with IVHD entries.
> 
> Importantly, the IVHD entries used here include a Device ID, which is
> simply the PCI BDF (Bus/Device/Function).  The guest firmware is
> responsible for programming bus numbers, so the final revision of this
> table depends on the update mechanism (acpi_build_update) to be called
> after guest PCI enumeration.

Ouch... so the ACPI build procedure is after those guest PCI code!
Could I ask how do you find this? :) It seems much easier for sure
this way...

This looks very nice to me already, though I still have got a few
questions, please see below.

[...]

> +if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_BRIDGE)) {
> +PCIBus *sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(dev));
> +uint8_t sec = pci_bus_num(sec_bus);
> +uint8_t sub = dev->config[PCI_SUBORDINATE_BUS];
> +
> +if (pci_bus_is_express(sec_bus)) {
> +/*
> + * Walk the bus if there are subordinates, otherwise use a range
> + * to cover an entire leaf bus.  We could potentially also use a
> + * range for traversed buses, but we'd need to take care not to
> + * create both Select and Range entries covering the same device.
> + * This is easier and potentially more compact.
> + *
> + * An example bare metal system seems to use Select entries for
> + * root ports without a slot (ie. built-ins) and Range entries
> + * when there is a slot.  The same system also only hard-codes
> + * the alias range for an onboard PCIe-to-PCI bridge, apparently
> + * making no effort to support nested bridges.  We attempt to
> + * be more thorough here.
> + */
> +if (sec == sub) { /* leaf bus */
> +/* "Start of Range" IVHD entry, type 0x3 */
> +entry = PCI_BUILD_BDF(sec, PCI_DEVFN(0, 0)) << 8 | 0x3;
> +build_append_int_noprefix(table_data, entry, 4);
> +/* "End of Range" IVHD entry, type 0x4 */
> +entry = PCI_BUILD_BDF(sub, PCI_DEVFN(31, 7)) << 8 | 0x4;
> +build_append_int_noprefix(table_data, entry, 4);
> +} else {
> +pci_for_each_device(sec_bus, sec, insert_ivhd, table_data);
> +}
> +} else {
> +/*
> + * If the secondary bus is conventional, then we need to create 
> an
> + * Alias range for everything downstream.  The range covers the
> + * first devfn on the secondary bus to the last devfn on the
> + * subordinate bus.  The alias target depends on legacy versus
> + * express bridges, just as in pci_device_iommu_address_space().
> + * DeviceIDa vs DeviceIDb as per the AMD IOMMU spec.
> + */
> +uint16_t dev_id_a, dev_id_b;
> +
> +dev_id_a = PCI_BUILD_BDF(sec, PCI_DEVFN(0, 0));
> +
> +if (pci_is_express(dev) &&
> +pcie_cap_get_type(dev) == PCI_EXP_TYPE_PCI_BRIDGE) {
> +dev_id_b = dev_id_a;
> +} else {
> +dev_id_b = PCI_BUILD_BDF(pci_bus_num(bus), dev->devfn);
> +}
> +
> +/* "Alias Start of Range" IVHD entry, type 0x43, 8 bytes */
> +build_append_int_noprefix(table_data, dev_id_a << 8 | 0x43, 4);
> +build_append_int_noprefix(table_data, dev_id_b << 8 | 0x0, 4);
> +
> +/* "End of Range" IVHD entry, type 0x4 */
> +entry = PCI_BUILD_BDF(sub, PCI_DEVFN(31, 7)) << 8 | 0x4;
> +build_append_int_noprefix(table_data, entry, 4);
> +}

We've implmented the similar logic for multiple times:

  - When we want to do DMA (pci_requester_id)
  - When we want to fetch the DMA address space (the previous patch)
  - When we fill in the AMD ACPI table (this patch)

Do you think we can generalize them somehow?  I'm thinking how about
we directly fetch RID in the 2nd/3rd use case using pci_requester_id()
(which existed already) and simply use it?

[...]

> +/*
> + * A PCI bus walk, for each PCI host bridge, is necessary to create a
> + * complete set of IVHD entries.  Do this into a separate blob so that we
> + * can calculate the total IVRS table length here and then append the new
> + * blob further below.  Fall back to an entry covering all devices, which
> + * is sufficient when no aliases are present.
> + */
> +object_child_foreach_recursive(object_get_root(),
> +

[Qemu-devel] [for-4.2 PATCH 2/2] hw/i386: AMD-Vi IVRS DMA alias support

2019-07-26 Thread Alex Williamson
When we account for DMA aliases in the PCI address space, we can no
longer use a single IVHD entry in the IVRS covering all devices.  We
instead need to walk the PCI bus and create alias ranges when we find
a conventional bus.  These alias ranges cannot overlap with a "Select
All" range (as currently implemented), so we also need to enumerate
each device with IVHD entries.

Importantly, the IVHD entries used here include a Device ID, which is
simply the PCI BDF (Bus/Device/Function).  The guest firmware is
responsible for programming bus numbers, so the final revision of this
table depends on the update mechanism (acpi_build_update) to be called
after guest PCI enumeration.

For an example guest configuration of:

-+-[:40]---00.0-[41]00.0  Intel Corporation 82574L Gigabit Network 
Connection
 \-[:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
 +-01.0  Device 1234:
 +-02.0-[01]00.0  Intel Corporation 82574L Gigabit Network 
Connection
 +-02.1-[02]00.0  Red Hat, Inc. QEMU XHCI Host Controller
 +-02.2-[03]--
 +-02.3-[04]--
 +-02.4-[05]--
 +-02.5-[06-09]00.0-[07-09]--+-00.0-[08]--
 |   \-01.0-[09]00.0  Intel 
Corporation 82574L Gigabit Network Connection
 +-02.6-[0a-0c]00.0-[0b-0c]--+-01.0-[0c]--
 |   \-03.0  Intel Corporation 82540EM 
Gigabit Ethernet Controller
 +-02.7-[0d]0e.0  Intel Corporation 82540EM Gigabit Ethernet 
Controller
 +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
 +-04.0  Advanced Micro Devices, Inc. [AMD] Device 0020
 +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller
 +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA 
Controller [AHCI mode]
 \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller

Where we have:

00:02.7 PCI bridge: Intel Corporation 82801 PCI Bridge
 (dmi-to-pci-bridge)
00:03.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
 (pcie-expander-bus)
06:00.0 PCI bridge: Texas Instruments XIO3130 PCI Express Switch (Upstream)
 (pcie-switch-upstream-port)
07:00.0 PCI bridge: Texas Instruments XIO3130 PCI Express Switch (Downstream)
 (pcie-switch-downstream-port)
07:01.0 PCI bridge: Texas Instruments XIO3130 PCI Express Switch (Downstream)
 (pcie-switch-downstream-port)
0a:00.0 PCI bridge: Red Hat, Inc. Device 000e
 (pcie-to-pci-bridge)

The following IVRS table is produced:

AMD-Vi: Using IVHD type 0x10
AMD-Vi: device: 00:04.0 cap: 0040 seg: 0 flags: d1 info 
AMD-Vi:mmio-addr: fed8
AMD-Vi:   DEV_SELECT devid: 40:00.0 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START devid: 41:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END  devid: 41:1f.7
AMD-Vi:   DEV_SELECT devid: 00:00.0 flags: 00
AMD-Vi:   DEV_SELECT devid: 00:01.0 flags: 00
AMD-Vi:   DEV_SELECT devid: 00:02.0 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START devid: 01:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END  devid: 01:1f.7
AMD-Vi:   DEV_SELECT devid: 00:02.1 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START devid: 02:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END  devid: 02:1f.7
AMD-Vi:   DEV_SELECT devid: 00:02.2 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START devid: 03:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END  devid: 03:1f.7
AMD-Vi:   DEV_SELECT devid: 00:02.3 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START devid: 04:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END  devid: 04:1f.7
AMD-Vi:   DEV_SELECT devid: 00:02.4 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START devid: 05:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END  devid: 05:1f.7
AMD-Vi:   DEV_SELECT devid: 00:02.5 flags: 00
AMD-Vi:   DEV_SELECT devid: 06:00.0 flags: 00
AMD-Vi:   DEV_SELECT devid: 07:00.0 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START devid: 08:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END  devid: 08:1f.7
AMD-Vi:   DEV_SELECT devid: 07:01.0 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START devid: 09:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END  devid: 09:1f.7
AMD-Vi:   DEV_SELECT devid: 00:02.6 flags: 00
AMD-Vi:   DEV_SELECT devid: 0a:00.0 flags: 00
AMD-Vi:   DEV_ALIAS_RANGEdevid: 0b:00.0 flags: 00 devid_to: 
0b:00.0
AMD-Vi:   DEV_RANGE_END  devid: 0c:1f.7
AMD-Vi:   DEV_SELECT devid: 00:02.7 flags: 00
AMD-Vi:   DEV_ALIAS_RANGEdevid: 0d:00.0 flags: 00 devid_to: 
00:02.7
AMD-Vi:   DEV_RANGE_END  devid: 0d:1f.7
AMD-Vi:   DEV_SELECT devid: 00:03.0 flags: 00
AMD-Vi:   DEV_SELECT