Re: [Qemu-devel] [Question] Memory hotplug clarification for Qemu ARM/virt

2019-05-08 Thread Robin Murphy

Hi Shameer,

On 08/05/2019 11:15, Shameerali Kolothum Thodi wrote:

Hi,

This series here[0] attempts to add support for PCDIMM in QEMU for
ARM/Virt platform and has stumbled upon an issue as it is not clear(at least
from Qemu/EDK2 point of view) how in physical world the hotpluggable
memory is handled by kernel.

The proposed implementation in Qemu, builds the SRAT and DSDT parts
and uses GED device to trigger the hotplug. This works fine.

But when we added the DT node corresponding to the PCDIMM(cold plug
scenario), we noticed that Guest kernel see this memory during early boot
even if we are booting with ACPI. Because of this, hotpluggable memory
may end up in zone normal and make it non-hot-un-pluggable even if Guest
boots with ACPI.

Further discussions[1] revealed that, EDK2 UEFI has no means to interpret the
ACPI content from Qemu(this is designed to do so) and uses DT info to
build the GetMemoryMap(). To solve this, introduced "hotpluggable" property
to DT memory node(patches #7 & #8 from [0]) so that UEFI can differentiate
the nodes and exclude the hotpluggable ones from GetMemoryMap().

But then Laszlo rightly pointed out that in order to accommodate the changes
into UEFI we need to know how exactly Linux expects/handles all the
hotpluggable memory scenarios. Please find the discussion here[2].

For ease, I am just copying the relevant comment from Laszlo below,

/**
"Given patches #7 and #8, as I understand them, the firmware cannot distinguish
  hotpluggable & present, from hotpluggable & absent. The firmware can only
  skip both hotpluggable cases. That's fine in that the firmware will hog 
neither
  type -- but is that OK for the OS as well, for both ACPI boot and DT boot?

Consider in particular the "hotpluggable & present, ACPI boot" case. Assuming
we modify the firmware to skip "hotpluggable" altogether, the UEFI memmap
will not include the range despite it being present at boot. Presumably, ACPI
will refer to the range somehow, however. Will that not confuse the OS?

When Igor raised this earlier, I suggested that hotpluggable-and-present should
be added by the firmware, but also allocated immediately, as EfiBootServicesData
type memory. This will prevent other drivers in the firmware from allocating 
AcpiNVS
or Reserved chunks from the same memory range, the UEFI memmap will contain
the range as EfiBootServicesData, and then the OS can release that allocation in
one go early during boot.

But this really has to be clarified from the Linux kernel's expectations. Please
formalize all of the following cases:

OS boot (DT/ACPI)  hotpluggable & ...  GetMemoryMap() should report as  DT/ACPI 
should report as
-  --  ---  

DT present ??
DT absent  ??
ACPI   present ??
ACPI   absent  ??

Again, this table is dictated by Linux."

**/

Could you please take a look at this and let us know what is expected here from
a Linux kernel view point.


For arm64, so far we've not even been considering DT-based hotplug - as 
far as I'm aware there would still be a big open question there around 
notification mechanisms and how to describe them. The DT stuff so far 
has come from the PowerPC folks, so it's probably worth seeing what 
their ideas are.


ACPI-wise I've always assumed/hoped that hotplug-related things should 
be sufficiently well-specified in UEFI that "do whatever x86/IA-64 do" 
would be enough for us.


Robin.


(Hi Laszlo/Igor/Eric, please feel free to add/change if I have missed any valid
points above).

Thanks,
Shameer
[0] https://patchwork.kernel.org/cover/10890919/
[1] https://patchwork.kernel.org/patch/10863299/
[2] https://patchwork.kernel.org/patch/10890937/






Re: [RFC] virtio-iommu: Take into account possible aliasing in virtio_iommu_mr()

2023-01-20 Thread Robin Murphy

On 2023-01-20 15:28, Jean-Philippe Brucker wrote:

For some reason this came through as blank mail with a text attachment, 
so apologies for the lack of quoting, but for reference this is a 
long-standing known issue:


https://lore.kernel.org/linux-iommu/9625faf4-48ef-2dd3-d82f-931d9cf26...@huawei.com/

In summary, yes, hanging {of,acpi}_iommu_configure() off driver probe is 
massively wrong, and pulling that step into iommu_probe_device() in a 
sane and robust manner is one of the next things to start on after 
getting the bus ops out of the way.


Thanks,
Robin.



Re: [PATCH RFC] hw/arm/virt: Avoid unexpected warning from Linux guest on host with Fujitsu CPUs

2024-06-06 Thread Robin Murphy
o took a look.
of_dma_configure() is being passed force_dma = true.
https://elixir.bootlin.com/linux/v6.10-rc2/source/drivers/amba/bus.c#L361

The is a comment in of_dma_configure()
/*
 * For legacy reasons, we have to assume some devices need
 * DMA configuration regardless of whether "dma-ranges" is
 * correctly specified or not.
 */
So this I think this is being triggered by a workaround for broken DT.

This was introduced by Robin Murphy +CC though you may need to ask on
kernel list because ARM / QEMU fun.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=723288836628b

Relevant comment from that patch description:

"Certain bus types have a general expectation of
DMA capability and carry a well-established precedent that an absent
"dma-ranges" implies the same as the empty property, so we automatically
opt those in to DMA configuration regardless, to avoid regressing most
existing platforms."

The patch implies that AMBA is one of those.

So not sure this is solveable without a hack such as eliding the warning
message if dma_force was set as the situation probably isn't relevant then..


Except it absolutely is, because the whole reason for setting force_dma 
on those buses is that they *do* commonly have DMA-capable devices, and 
they are also commonly non-coherent such that this condition would be 
serious. Especially AMBA, given that the things old enough to still be 
using that abstraction rather than plain platform (PL080, PL111, 
PL330,...) all predate ACE-Lite so don't even have the *possibility* of 
being coherent without external trickery in the interconnect.


Thanks,
Robin.



Re: [PATCH v3] hw/arm/virt: Avoid unexpected warning from Linux guest on host with Fujitsu CPUs

2024-06-12 Thread Robin Murphy

On 2024-06-12 1:50 pm, Philippe Mathieu-Daudé wrote:

On 12/6/24 14:48, Peter Maydell wrote:
On Wed, 12 Jun 2024 at 13:33, Philippe Mathieu-Daudé 
 wrote:


Hi Zhenyu,

On 12/6/24 04:05, Zhenyu Zhang wrote:

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 3c93c0c0a6..3cefac6d43 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -271,6 +271,17 @@ static void create_fdt(VirtMachineState *vms)
   qemu_fdt_setprop_cell(fdt, "/", "#size-cells", 0x2);
   qemu_fdt_setprop_string(fdt, "/", "model", "linux,dummy-virt");

+    /*
+ * For QEMU, all DMA is coherent. Advertising this in the root 
node

+ * has two benefits:
+ *
+ * - It avoids potential bugs where we forget to mark a DMA
+ *   capable device as being dma-coherent
+ * - It avoids spurious warnings from the Linux kernel about
+ *   devices which can't do DMA at all
+ */
+    qemu_fdt_setprop(fdt, "/", "dma-coherent", NULL, 0);


OK, but why restrict that to the Aarch64 virt machine?
Shouldn't advertise this generically in create_device_tree()?
Or otherwise at least in the other virt machines?


create_device_tree() creates an empty device tree, not one
with stuff in it. It seems reasonable to me for this property
on the root to be set in the same place we set other properties
of the root node.


OK. Still the question about other virt machines remains
unanswered :)


From the DT consumer point of view, the interpretation and assumptions 
around coherency *are* generally architecture- or platform-specific. For 
instance on RISC-V, many platforms want to assume coherency by default 
(and potentially use "dma-noncoherent" to mark individual devices that 
aren't), while others may still want to do the opposite and use 
"dma-coherent" in the same manner as Arm and AArch64. Neither property 
existed back in ePAPR, so typical PowerPC systems wouldn't even be 
looking and will just make their own assumptions by other means.


Thanks,
Robin.



Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

2019-03-28 Thread Robin Murphy

On 28/03/2019 10:38, Auger Eric wrote:

Hi Alex,

[+ Robin]

On 3/27/19 5:37 PM, Alex Williamson wrote:

On Wed, 27 Mar 2019 14:25:00 +0800
Peter Xu  wrote:


On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:

Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
distinguish by devfn & bus between devices in a conventional PCI
topology and therefore we cannot assign them separate AddressSpaces.
By taking this requester ID aliasing into account, QEMU better matches
the bare metal behavior and restrictions, and enables shared
AddressSpace configurations that are otherwise not possible with
guest IOMMU support.

For the latter case, given any example where an IOMMU group on the
host includes multiple devices:

   $ ls  /sys/kernel/iommu_groups/1/devices/
   :00:01.0  :01:00.0  :01:00.1


[1]



If we incorporate a vIOMMU into the VM configuration, we're restricted
that we can only assign one of the endpoints to the guest because a
second endpoint will attempt to use a different AddressSpace.  VFIO
only supports IOMMU group level granularity at the container level,
preventing this second endpoint from being assigned:

qemu-system-x86_64 -machine q35... \
   -device intel-iommu,intremap=on \
   -device pcie-root-port,addr=1e.0,id=pcie.1 \
   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1

qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
:01:00.1: group 1 used in multiple address spaces

However, when QEMU incorporates proper aliasing, we can make use of a
PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
provides the downstream devices with the same AddressSpace, ex:

qemu-system-x86_64 -machine q35... \
   -device intel-iommu,intremap=on \
   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1

While the utility of this hack may be limited, this AddressSpace
aliasing is the correct behavior for QEMU to emulate bare metal.

Signed-off-by: Alex Williamson 


The patch looks sane to me even as a bug fix since otherwise the DMA
address spaces used under misc kinds of PCI bridges can be wrong, so:


I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but
I'd be cautious about this if so.  Eric Auger noted that he's seen an
SMMU VM hit a guest kernel bug-on, which needs further
investigation.  It's not clear if it's just an untested or
unimplemented scenario for SMMU to see a conventional PCI bus or if
there's something wrong in QEMU.  I also haven't tested AMD IOMMU and
only VT-d to a very limited degree, thus RFC.


So I have tracked this further and here is what I can see.

On guest side, the 2 assigned devices that I have put downstream to the
pcie-to-pci bridge get an iommu_fwspec handle with 2 ids, the first one
corresponding to the requester id of the very device and the second one
corresponding to the rid matching the same bus number and devfn=0

dev0 =  :02:01.0
 :02:00.0

dev1 = :02:01.1
:02:00.0

Then iommu_probe_device is called for :02:01.0 and :02:01.1.
Each time it iterates over the associated ids and we call add_device
twice for :02:00.0. The second time, the arm-smmu-v3 driver
recognizes a context is already alive for :02:00.0 and triggers a
BUG_ON().


Hmm, aliasing bridges are supposed to be handled as of commit 
563b5cbe334e ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs") - 
what's changed since then?


Robin.


At the origin of the creation of 2 ids for each device,
iort_iommu_configure is called on each downstream device which calls
pci_for_each_dma_alias(). We enter the
pci_is_pcie(tmp)/PCI_EXP_TYPE_PCI_BRIDGE code path and
iort_pci_iommu_init is called with bus number 2 and devfn=0.

Thanks

Eric
  

Reviewed-by: Peter Xu 

Though I have a question that confused me even before: Alex, do you
know why all the context entry of the devices in the IOMMU root table
will be programmed even if the devices are under a pcie-to-pci bridge?
I'm giving an example with above [1] to be clear: in that case IIUC
we'll program context entries for all the three devices (00:01.0,
01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
bare metal and then why we bother to program the context entry of
01:00.1?  It seems never used.

(It should be used for current QEMU to work with pcie-to-pci bridges
  if without this patch, but I feel like I don't know the real answer
  behind)


We actually have two different scenarios that could be represented by
[1], the group can be formed by lack of isolation or by lack of
visibility.  In the group above, it's the former, isolation.  The PCIe
root port does not support ACS, so while the IOMMU has visibility of
the individual devices, peer-to-peer between

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

2019-03-28 Thread Robin Murphy

On 28/03/2019 12:53, Auger Eric wrote:

Hi Robin,

On 3/28/19 11:56 AM, Robin Murphy wrote:

On 28/03/2019 10:38, Auger Eric wrote:

Hi Alex,

[+ Robin]

On 3/27/19 5:37 PM, Alex Williamson wrote:

On Wed, 27 Mar 2019 14:25:00 +0800
Peter Xu  wrote:


On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:

Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
distinguish by devfn & bus between devices in a conventional PCI
topology and therefore we cannot assign them separate AddressSpaces.
By taking this requester ID aliasing into account, QEMU better matches
the bare metal behavior and restrictions, and enables shared
AddressSpace configurations that are otherwise not possible with
guest IOMMU support.

For the latter case, given any example where an IOMMU group on the
host includes multiple devices:

    $ ls  /sys/kernel/iommu_groups/1/devices/
    :00:01.0  :01:00.0  :01:00.1


[1]



If we incorporate a vIOMMU into the VM configuration, we're restricted
that we can only assign one of the endpoints to the guest because a
second endpoint will attempt to use a different AddressSpace.  VFIO
only supports IOMMU group level granularity at the container level,
preventing this second endpoint from being assigned:

qemu-system-x86_64 -machine q35... \
    -device intel-iommu,intremap=on \
    -device pcie-root-port,addr=1e.0,id=pcie.1 \
    -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
    -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1

qemu-system-x86_64: -device
vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
:01:00.1: group 1 used in multiple address spaces

However, when QEMU incorporates proper aliasing, we can make use of a
PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
provides the downstream devices with the same AddressSpace, ex:

qemu-system-x86_64 -machine q35... \
    -device intel-iommu,intremap=on \
    -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
    -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
    -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1

While the utility of this hack may be limited, this AddressSpace
aliasing is the correct behavior for QEMU to emulate bare metal.

Signed-off-by: Alex Williamson 


The patch looks sane to me even as a bug fix since otherwise the DMA
address spaces used under misc kinds of PCI bridges can be wrong, so:


I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but
I'd be cautious about this if so.  Eric Auger noted that he's seen an
SMMU VM hit a guest kernel bug-on, which needs further
investigation.  It's not clear if it's just an untested or
unimplemented scenario for SMMU to see a conventional PCI bus or if
there's something wrong in QEMU.  I also haven't tested AMD IOMMU and
only VT-d to a very limited degree, thus RFC.


So I have tracked this further and here is what I can see.

On guest side, the 2 assigned devices that I have put downstream to the
pcie-to-pci bridge get an iommu_fwspec handle with 2 ids, the first one
corresponding to the requester id of the very device and the second one
corresponding to the rid matching the same bus number and devfn=0

dev0 =  :02:01.0
  :02:00.0

dev1 = :02:01.1
     :02:00.0

Then iommu_probe_device is called for :02:01.0 and :02:01.1.
Each time it iterates over the associated ids and we call add_device
twice for :02:00.0. The second time, the arm-smmu-v3 driver
recognizes a context is already alive for :02:00.0 and triggers a
BUG_ON().


Hmm, aliasing bridges are supposed to be handled as of commit
563b5cbe334e ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs") -
what's changed since then?


I our case, the problem is not we have the same ids[] for a given device
but we have 2 functions attached to the pcie-to-pci bridge and their
fwspec have an ids[] in common, the one with the subordinate bus and
devfn=0.


Oh, right, the recollection of having fixed something like this before 
distracted me from that important detail, sorry :)


Yeah, it's a shortcoming of the driver - in general we don't support 
arbitrary devices sharing Stream IDs (as arm_smmu_device_group() says), 
but since we hadn't yet seen (nor expected) a non-trivial legacy PCI 
topology with SMMUv3, that case has never really been tested properly 
either.


Robin.


device pcie-pci-bridge,addr=1e.0,id=pci.1 \
-device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
-device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1

[2.509381] ixgbe :02:01.0: arm_smmu_attach_dev ARM_SMMU_DOMAIN_S1
[2.512095] arm_smmu_install_ste_for_dev sid[0]=520
[2.513524] arm_smmu_write_strtab_ent sid=520 val=1
[2.514872] arm_smmu_write_strtab_ent sid=520 ABORT
[2.516266] arm_smmu_install_ste_for_dev sid[1]=512
[2.517609] arm_smmu_write_strtab_ent sid=512 val=1
[2.518958] arm_smmu_wr

Re: [PATCH v2 9/9] hw/arm/smmuv3: Advertise SMMUv3.2 range invalidation

2020-07-06 Thread Robin Murphy

On 2020-07-02 16:26, Eric Auger wrote:

Expose the RIL bit so that the guest driver uses range
invalidation.


Hmm, this is a v3.2 feature... so strictly, in order to advertise it you 
would need to claim at least v3.1 in SMMU_AIDR and implement all the 
mandatory v3.1 behaviour ;)


Robin.


Signed-off-by: Eric Auger 
Reviewed-by: Peter Maydell 
---
  hw/arm/smmuv3-internal.h | 1 +
  hw/arm/smmuv3.c  | 2 ++
  2 files changed, 3 insertions(+)

diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h
index 5babf72f7d..4e7ec252ed 100644
--- a/hw/arm/smmuv3-internal.h
+++ b/hw/arm/smmuv3-internal.h
@@ -54,6 +54,7 @@ REG32(IDR1,0x4)
  
  REG32(IDR2,0x8)

  REG32(IDR3,0xc)
+FIELD(IDR3, RIL,  10, 1);
  REG32(IDR4,0x10)
  REG32(IDR5,0x14)
   FIELD(IDR5, OAS, 0, 3);
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 89ab11fc36..add4ba4543 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -254,6 +254,8 @@ static void smmuv3_init_regs(SMMUv3State *s)
  s->idr[1] = FIELD_DP32(s->idr[1], IDR1, EVENTQS, SMMU_EVENTQS);
  s->idr[1] = FIELD_DP32(s->idr[1], IDR1, CMDQS,   SMMU_CMDQS);
  
+s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 1);

+
 /* 4K and 64K granule support */
  s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1);
  s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1);