[virtio-dev] Re: [PATCH 4/4] virtio-iommu: Clarify hot-unplug behavior
On Sat, Aug 12, 2023 at 03:25:10PM +0900, Akihiko Odaki wrote: > > > By the way, crosvm's logic to detach endpoint on removal looks incorrect > > > for > > > me. A domain may have several endpoints attached, but the code looks like > > > it's always destroying a domain whether there are other endpoints attached > > > to the domain. I'm adding Zide Chen, who wrote the code according to git > > > blame, and crosvm-...@chromium.org to CC. > > > > Link to this thread for more context: > > https://lore.kernel.org/virtio-dev/20230803153238.541803-5-jean-phili...@linaro.org/ > > > > I thought crosvm rejected attaching multiple endpoints to one domain but I > > think I misread. Rejecting multiple attach would be a straightforward fix > > (it's allowed by the spec), though it would prevent assigning endpoints > > that cannot be isolated from each others by the hardware (the driver won't > > attach those to different domains, if it's made aware that they should be > > in the same IOMMU group, for example if they are on a conventional PCI > > bus). > > Now we figured out an endpoint should not be detached from a domain without > a request from the driver anyway so the code to detach an endpoint can be > simply removed. Yes, but I think the other detach path, when handling ATTACH or DETACH requests, doesn't support domains with multiple endpoints attached either: // Currently, we only support detaching an endpoint if it is the only endpoint attached // to its domain. But the ATTACH handler seems to accept attaching multiple endpoints to the same domain? Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [PATCH 4/4] virtio-iommu: Clarify hot-unplug behavior
On Fri, Aug 11, 2023 at 07:21:29AM +0900, Akihiko Odaki wrote: > On 2023/08/11 0:10, Jean-Philippe Brucker wrote: > > On Fri, Aug 04, 2023 at 03:19:27PM +0900, Akihiko Odaki wrote: > > > On 2023/08/04 0:32, Jean-Philippe Brucker wrote: > > > > Since it is not obvious what the virtio-iommu device or driver should do > > > > when an endpoint is removed [1], add some guidance in the specification. > > > > Follow the same principle as other (hardware) IOMMU devices: clearing > > > > endpoint ID state on endpoint removal is the responsibility of the > > > > driver, not the IOMMU device. > > > > > > > > On most hardware IOMMU devices, the endpoint ID state is kept in tables > > > > managed by the driver, so intuitively the driver should be updating the > > > > tables on hot-unplug, so that a new endpoint plugged later with the same > > > > ID does not inherit stale translations. Besides, hardware IOMMUs have no > > > > knowledge of endpoint removal. It is less obvious for virtio-iommu > > > > because endpoint states are managed with ATTACH/DETACH requests, and a > > > > virtual platform could in theory update the endpoint state when it is > > > > removed. Existing VMMs like QEMU don't do it, and rightly expect the > > > > driver to manage endpoint ID state like with other IOMMUs. It is not > > > > invalid for a VMM to clean up state on endpoint removal, but a driver > > > > shouldn't count on it. > > > > > > I think it's better to say it's invalid to detach on endpoint removal. > > > > It's certainly not a clear cut choice and I'm still hesitating whether we > > should allow, deny or let the VMM choose what to do. > > > > However, it looks like crosvm already detaches the domain when an endpoint > > is unplugged: > > > > https://github.com/google/crosvm/blob/e8b2cd080e8174948563567d395c2a42416f2807/devices/src/virtio/iommu/sys/unix.rs#L68 > > > > If I understood correctly, this function is called on PCIe hot-unplug, and > > endpoint_map represents the domain attached to an endpoint ID. So the > > domain is detached on endpoint removal. If a DETACH request is sent > > afterwards, it will still succeed (detach_endpoint() in > > devices/src/virtio/iommu.rs returns detached=true, which causes DETACH to > > succeed, even if the endpoint is not in endpoint_map). But I could be > > wrong as I'm not familiar with crosvm or rust. > > > > That does make the decision a little easier, because if we did > > retroactively specify that detaching on removal is invalid, this code > > would become a bug. But in my opinion crosvm isn't doing anything wrong > > here. It feels valid and even good practice to clean up all state > > associated to an endpoint when it is removed, to avoid leaks and stale > > objects. > > In this case I think it's creating use-after-free hazard. The guest believes > it is managing domains and assume a domain is available until it makes a > DETACH request for the last endpoint attached to the domain. > > However, if the host automatically detaches an endpoint from a domain and if > the domain has no other endpoints, the domain will be gone and the domain ID > may be reused for another domain. If the guest makes a request with the > stale domain ID before it becomes aware that the device is unplugged, the > request will be performed on some arbitrary domain and may break things. That's an excellent point, there is a race: VMMGUEST Remove ep 1 detach domain 1 destroy domain 1 ATTACH ep 2 to domain 1 Handle ATTACH create domain 1 Notify that ep 1 is gone Handle notification, DETACH ep 1 The guest tries to attach ep 2 to domain 1, which succeeds. So the guest expects that ep 2 is now able to access the mappings that were on domain 1. But instead it's a new empty domain. I think that's a good justification for adding a device normative statement, something like: The device SHOULD only detach endpoints from domains when requested by the driver; either when handling ATTACH and DETACH requests or when performing a device reset. > > By the way, crosvm's logic to detach endpoint on removal looks incorrect for > me. A domain may have several endpoints attached, but the code looks like > it's always destroying a domain whether there are other endpoints attached > to the domain. I'm adding Zide Chen, who wrote the code according to git > blame, and crosvm-...@chromium.org to CC. Link to this thread for more context: https://lore.kernel.org/vi
[virtio-dev] Re: [PATCH 4/4] virtio-iommu: Clarify hot-unplug behavior
On Fri, Aug 04, 2023 at 03:19:27PM +0900, Akihiko Odaki wrote: > On 2023/08/04 0:32, Jean-Philippe Brucker wrote: > > Since it is not obvious what the virtio-iommu device or driver should do > > when an endpoint is removed [1], add some guidance in the specification. > > Follow the same principle as other (hardware) IOMMU devices: clearing > > endpoint ID state on endpoint removal is the responsibility of the > > driver, not the IOMMU device. > > > > On most hardware IOMMU devices, the endpoint ID state is kept in tables > > managed by the driver, so intuitively the driver should be updating the > > tables on hot-unplug, so that a new endpoint plugged later with the same > > ID does not inherit stale translations. Besides, hardware IOMMUs have no > > knowledge of endpoint removal. It is less obvious for virtio-iommu > > because endpoint states are managed with ATTACH/DETACH requests, and a > > virtual platform could in theory update the endpoint state when it is > > removed. Existing VMMs like QEMU don't do it, and rightly expect the > > driver to manage endpoint ID state like with other IOMMUs. It is not > > invalid for a VMM to clean up state on endpoint removal, but a driver > > shouldn't count on it. > > I think it's better to say it's invalid to detach on endpoint removal. It's certainly not a clear cut choice and I'm still hesitating whether we should allow, deny or let the VMM choose what to do. However, it looks like crosvm already detaches the domain when an endpoint is unplugged: https://github.com/google/crosvm/blob/e8b2cd080e8174948563567d395c2a42416f2807/devices/src/virtio/iommu/sys/unix.rs#L68 If I understood correctly, this function is called on PCIe hot-unplug, and endpoint_map represents the domain attached to an endpoint ID. So the domain is detached on endpoint removal. If a DETACH request is sent afterwards, it will still succeed (detach_endpoint() in devices/src/virtio/iommu.rs returns detached=true, which causes DETACH to succeed, even if the endpoint is not in endpoint_map). But I could be wrong as I'm not familiar with crosvm or rust. That does make the decision a little easier, because if we did retroactively specify that detaching on removal is invalid, this code would become a bug. But in my opinion crosvm isn't doing anything wrong here. It feels valid and even good practice to clean up all state associated to an endpoint when it is removed, to avoid leaks and stale objects. > Otherwise, the DETACH request made by the driver on endpoint removal will > result in an error, which may make the driver emit an spurious warning. In > fact, the Linux driver emits a warning if a DETACH request fails* > > * > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/iommu/virtio-iommu.c?id=809d0810e3520da669d231303608cdf5fe5c1a70 > I agree that this WARN_ON is problematic, we may have to demote it to dev_warn() or remove it entirely. PCIe supports async removal, where a device is physically removed without the user first pressing a button to notify the OS that the device is going away. And I'm currently playing with a PCIe thunderbolt drive where I just yank the device out like any USB device, without first warning the OS (apart from unmounting partitions). If a VMM wanted to emulate that, I guess it would remove the device, possibly detach its IOMMU domain, and notify the OS afterwards. So the driver needs to accept a failed DETACH request more quietly. Thanks, Jean > > > > [1] > > https://lore.kernel.org/all/15bf1b00-3aa0-973a-3a86-3fa5c4d41...@daynix.com/ > > > > Reported-by: Akihiko Odaki > > Signed-off-by: Jean-Philippe Brucker > > --- > > device-types/iommu/description.tex | 5 + > > 1 file changed, 5 insertions(+) > > > > diff --git a/device-types/iommu/description.tex > > b/device-types/iommu/description.tex > > index fb1b840..63ccd41 100644 > > --- a/device-types/iommu/description.tex > > +++ b/device-types/iommu/description.tex > > @@ -407,6 +407,11 @@ \subsubsection{DETACH request} > > After all endpoints have been successfully detached from a domain, it > > ceases to exist and its ID can be reused by the driver for another domain. > > +When an endpoint is removed from the platform (for example > > +unplugged, or disabled with PCIe SR-IOV), the device is not > > +required to detach the endpoint ID from its domain. It is the > > +driver's responsibility to detach the endpoint before removal. > > + > > \drivernormative{\paragraph}{DETACH request}{Device Types / IOMMU Device > > / Device operations / DETACH request} > > The driver SHOULD set \field{reserved} to zero. - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH 4/4] virtio-iommu: Clarify hot-unplug behavior
Since it is not obvious what the virtio-iommu device or driver should do when an endpoint is removed [1], add some guidance in the specification. Follow the same principle as other (hardware) IOMMU devices: clearing endpoint ID state on endpoint removal is the responsibility of the driver, not the IOMMU device. On most hardware IOMMU devices, the endpoint ID state is kept in tables managed by the driver, so intuitively the driver should be updating the tables on hot-unplug, so that a new endpoint plugged later with the same ID does not inherit stale translations. Besides, hardware IOMMUs have no knowledge of endpoint removal. It is less obvious for virtio-iommu because endpoint states are managed with ATTACH/DETACH requests, and a virtual platform could in theory update the endpoint state when it is removed. Existing VMMs like QEMU don't do it, and rightly expect the driver to manage endpoint ID state like with other IOMMUs. It is not invalid for a VMM to clean up state on endpoint removal, but a driver shouldn't count on it. [1] https://lore.kernel.org/all/15bf1b00-3aa0-973a-3a86-3fa5c4d41...@daynix.com/ Reported-by: Akihiko Odaki Signed-off-by: Jean-Philippe Brucker --- device-types/iommu/description.tex | 5 + 1 file changed, 5 insertions(+) diff --git a/device-types/iommu/description.tex b/device-types/iommu/description.tex index fb1b840..63ccd41 100644 --- a/device-types/iommu/description.tex +++ b/device-types/iommu/description.tex @@ -407,6 +407,11 @@ \subsubsection{DETACH request} After all endpoints have been successfully detached from a domain, it ceases to exist and its ID can be reused by the driver for another domain. +When an endpoint is removed from the platform (for example +unplugged, or disabled with PCIe SR-IOV), the device is not +required to detach the endpoint ID from its domain. It is the +driver's responsibility to detach the endpoint before removal. + \drivernormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / Device operations / DETACH request} The driver SHOULD set \field{reserved} to zero. -- 2.41.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH 3/4] virtio-iommu: Fix contradictory and unnecessary statements
In the device-normative statement for the ATTACH request, we say: If another endpoint is already attached to the domain identified by \field{domain}, then the device MAY attach the endpoint identified by \field{endpoint} to the domain. and later: If properties of the endpoint (obtained with a PROBE request) are compatible with properties of other endpoints already attached to the requested domain, then the device SHOULD attach the endpoint. These statements contradict each others. Attaching multiple endpoints to one domain is a convenience that allows using fewer page table resources. The device is always allowed to reject a second attach request, in which case the driver simply creates a separate domain for the second endpoint. Remove the "SHOULD" statement and keep the "MAY" one. Also remove the following statement which doesn't add anything useful. Signed-off-by: Jean-Philippe Brucker --- device-types/iommu/description.tex | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/device-types/iommu/description.tex b/device-types/iommu/description.tex index c9b43db..fb1b840 100644 --- a/device-types/iommu/description.tex +++ b/device-types/iommu/description.tex @@ -383,12 +383,9 @@ \subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device op VIRTIO_IOMMU_S_UNSUPP. If properties of the endpoint (obtained with a PROBE request) are -compatible with properties of other endpoints already attached to the -requested domain, then the device SHOULD attach the endpoint. Otherwise -the device SHOULD reject the request and set \field{status} to -VIRTIO_IOMMU_S_UNSUPP. - -A device that does not reject the request MUST attach the endpoint. +not compatible with properties of other endpoints already +attached to the domain, the device SHOULD reject the request and +set \field{status} to VIRTIO_IOMMU_S_UNSUPP. \subsubsection{DETACH request} -- 2.41.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH 1/4] virtio-iommu: Fix typo in label
Use the 'sec:' prefix like all other labels Signed-off-by: Jean-Philippe Brucker --- device-types/iommu/description.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/device-types/iommu/description.tex b/device-types/iommu/description.tex index f8cbe88..d13a880 100644 --- a/device-types/iommu/description.tex +++ b/device-types/iommu/description.tex @@ -796,7 +796,7 @@ \subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device The device SHOULD NOT allow accesses from the endpoint to RESV_MEM regions to affect any other component than the endpoint and the driver. -\subsubsection{Fault reporting}\label{sev:Device Types / IOMMU Device / Device operations / Fault reporting} +\subsubsection{Fault reporting}\label{sec:Device Types / IOMMU Device / Device operations / Fault reporting} The device can report translation faults and other significant asynchronous events on the event virtqueue. The driver initially populates -- 2.41.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH 0/4] virtio-iommu: Minor specification fixes
Patches 1 and 2 fix typos in the IOMMU device type. Patch 3 removes a contradictory statement, and a useless one. Patch 4 adds some information about removal of endpoints managed by the IOMMU, following a recent fix in the Linux driver. Jean-Philippe Brucker (4): virtio-iommu: Fix typo in label virtio-iommu: Fix RESV_MEM typo virtio-iommu: Fix contradictory and unnecessary statements virtio-iommu: Clarify hot-unplug behavior device-types/iommu/description.tex | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) -- 2.41.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH 2/4] virtio-iommu: Fix RESV_MEM typo
Remove an extra word in the RESV_MEM description. Signed-off-by: Jean-Philippe Brucker --- device-types/iommu/description.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/device-types/iommu/description.tex b/device-types/iommu/description.tex index d13a880..c9b43db 100644 --- a/device-types/iommu/description.tex +++ b/device-types/iommu/description.tex @@ -758,8 +758,8 @@ \subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device \begin{description} \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)] These virtual addresses cannot be used in a MAP requests. The region -is be reserved by the device, for example, if the platform needs to -setup DMA mappings of its own. +is reserved by the device, for example if the platform needs +to setup DMA mappings of its own. \item[VIRTIO_IOMMU_RESV_MEM_T_MSI (1)] This region is a doorbell for Message Signaled Interrupts (MSIs). It -- 2.41.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: virtio-iommu hotplug issue
On Thu, Apr 13, 2023 at 08:01:54PM +0900, Akihiko Odaki wrote: > Yes, that's right. The guest can dynamically create and delete VFs. The > device is emulated by QEMU: igb, an Intel NIC recently added to QEMU and > projected to be released as part of QEMU 8.0. Ah great, that's really useful, I'll add it to my tests > > Yes, I think this is an issue in the virtio-iommu driver, which should be > > sending a DETACH request when the VF is disabled, likely from > > viommu_release_device(). I'll work on a fix unless you would like to do it > > It will be nice if you prepare a fix. I will test your patch with my > workload if you share it with me. I sent a fix: https://lore.kernel.org/linux-iommu/20230414150744.562456-1-jean-phili...@linaro.org/ Thank you for reporting this, it must have been annoying to debug Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: virtio-iommu hotplug issue
Hello, On Thu, Apr 13, 2023 at 01:49:43PM +0900, Akihiko Odaki wrote: > Hi, > > Recently I encountered a problem with the combination of Linux's > virtio-iommu driver and QEMU when a SR-IOV virtual function gets disabled. > I'd like to ask you what kind of solution is appropriate here and implement > the solution if possible. > > A PCIe device implementing the SR-IOV specification exports a virtual > function, and the guest can enable or disable it at runtime by writing to a > configuration register. This effectively looks like a PCI device is > hotplugged for the guest. Just so I understand this better: the guest gets a whole PCIe device PF that implements SR-IOV, and so the guest can dynamically create VFs? Out of curiosity, is that a hardware device assigned to the guest with VFIO, or a device emulated by QEMU? > In such a case, the kernel assumes the endpoint is > detached from the virtio-iommu domain, but QEMU actually does not detach it. > > This inconsistent view of the removed device sometimes prevents the VM from > correctly performing the following procedure, for example: > 1. Enable a VF. > 2. Disable the VF. > 3. Open a vfio container. > 4. Open the group which the PF belongs to. > 5. Add the group to the vfio container. > 6. Map some memory region. > 7. Close the group. > 8. Close the vfio container. > 9. Repeat 3-8 > > When the VF gets disabled, the kernel assumes the endpoint is detached from > the IOMMU domain, but QEMU actually doesn't detach it. Later, the domain > will be reused in step 3-8. > > In step 7, the PF will be detached, and the kernel thinks there is no > endpoint attached and the mapping the domain holds is cleared, but the VF > endpoint is still attached and the mapping is kept intact. > > In step 9, the same domain will be reused again, and the kernel requests to > create a new mapping, but it will conflict with the existing mapping and > result in -EINVAL. > > This problem can be fixed by either of: > - requesting the detachment of the endpoint from the guest when the PCI > device is unplugged (the VF is disabled) Yes, I think this is an issue in the virtio-iommu driver, which should be sending a DETACH request when the VF is disabled, likely from viommu_release_device(). I'll work on a fix unless you would like to do it > - detecting that the PCI device is gone and automatically detach it on > QEMU-side. > > It is not completely clear for me which solution is more appropriate as the > virtio-iommu specification is written in a way independent of the endpoint > mechanism and does not say what should be done when a PCI device is > unplugged. Yes, I'm not sure it's in scope for the specification, it's more about software guidance Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [virtio-comment] Re: [virtio] [PATCH] virtio: clarify feature negotiation
On Mon, Jan 24, 2022 at 04:42:35PM -0500, Michael S. Tsirkin wrote: > So from that point of view, we are actually relaxing the requirements, > and yes we'll need to carefully go over each instance of > "offered". > I thought I did, but now I did a quick grep again and I found instances > in virtio-iommu.tex and virtio-gpio.tex > Both of these are new, so I think we can just fix them to "negotiated". > > Jean-Philippe, Viresh, what do you think? Would it be a problem to > replace offered with negotiated, and restrict iommu and gpio drivers to > access config space after FEATURES_OK? No problem for iommu. The config space fields domain_range and input_range are not useful to the driver before DRIVER_OK. Reading field bypass earlier could be useful for debugging, but in that case going a few more steps into device initialization wouldn't hurt. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] Re: [PATCH V2 2/2] virtio-pci: add PASID configuration extended capability
On Wed, Jan 19, 2022 at 06:53:17PM -0500, Michael S. Tsirkin wrote: > > I think we should be explicit about this particular case because someone > > implementing this extension might get it wrong. MSIs should not have a > > PASID because IOMMUs don't support it: > > > > * VT-d treats Requests-with-PASID to the special range 0xfeex as > > normal DMA (3.14 Handling Requests to Interrupt Address Range) > > * AMD IOMMU reports an error for MSIs with PASID (2.2.7.7 PCIe® TLP PASID > > Prefix). > > Ouch. I didn't realize. Really weird, of the "what were they thinking" > variety. The natural thing would be to have PASID==VM and scope > both DMA and MSI within PASID. I guess that is precluded on these > platforms then. > I think the problem is mainly that MSIs appear as normal memory writes requests. If PCI had given them a special type, IOMMU/IRQ remapping hardware could route them more easily, but as is they only have the transaction's address to work with. So vendors made different choices for supporting MSIs with their IRQ remapping hardware. Intel and AMD define reserved address ranges 0xfeex, where any memory transaction is treated as MSI. This is fine when the kernel is in charge of allocating the I/O address space for DMA, it just needs to make sure IOVAs are not allocated within the address region reserved for MSIs. But it's more complicated with SVA. One of the use of PASID is assigning a partition of device (here a group of virtqueues) to a process. With this approach (SVA) the PASID accesses the process' page tables, so the virtual address space is shared between CPU and IOMMU, IOVA==VA. How would we deal with MSIs in this case, if they had a PASID? On x86, if the IOMMU performed IRQ remapping instead of address translation, on a memory write with PASID to address 0xfeex, the process couldn't use any address in that range for DMA. The process would need to filter addresses returned by malloc() and treat some of them as non-DMA'able, which defeats the purpose of SVA. To avoid having to do this Intel treats all DMA with PASID as normal memory access, and the process can't accidentally trigger MSIs by using the wrong address. I think AMD behaves the same way but am not entirely sure - the wording in one part of the spec (2.2.7.7) seems to imply that PASID access to the MSI range would trigger an IOMMU error report, in which case the range would need carving out anyway. > > * Arm requires creating mappings to the MSI controller in the SMMU. This > > has implications for SVA where the PASID accesses the whole process > > address space: if MSI transactions have a PASID prefix, that requires > > mapping the MSI controller into the process address space on Arm, which > > isn't convenient. > > I'd like to understand this part a bit better. Can you explain? > We are talking about the IOMMU address space right? > What is wrong with mapping the MSI controller there? > Isn't convenient in what sense? On Arm the IRQ remapping device is separate from the IOMMU, so there is no special address range as far as the IOMMU is concerned. An MSI goes through IOMMU address translation like any other memory write, the IOVA gets translated into a PA that corresponds to the doorbell (MMIO) address of the IRQ remapping device. The IOVA of the MSI is chosen by the OS and can have any value. Now if MSIs had a PASID, then the IOMMU would translate MSI addresses with the page tables corresponding to that PASID, which with SVA correspond to the process page tables. So the kernel would need to create a mapping inside the process' address space, to the PA of the IRQ remapping device. That's the part that is inconvenient (would not be a security issue because accesses to the doorbell from CPUs are ignored according to Arm's platform spec, but it would certainly be awkward to set up). Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] Re: [PATCH V2 2/2] virtio-pci: add PASID configuration extended capability
On Mon, Jan 17, 2022 at 01:57:47PM +0800, Jason Wang wrote: > How about something like: > > " > Except for the MSI transactions, the device MUST use PASID TLP prefix > for the following memory transactions initiated by the virtqueue that > belong to a virtqueue group if a valid PASID is assigned and PASID is > enabled in the PASID extended capability: > > \begin{itemize} > \item Memory Requests (including AtomicOp Requests) with Untranslated > Addresses > \item Address Translation Requests > \item Page Request Messages > \end{itemize} > > The device MUST NOT use PASID TLP prefix for the MSI memory > transactions. > Sure, that looks good. Maybe splitting it would make it easier to explain. If we force the driver to enable the PASID cap before using PASID: "The driver MUST NOT assign a PASID to a virtqueue group if the PASID extended capability is not enabled." Then the device description is not concerned about the PASID cap: "If a valid PASID is assigned to a virtqueue group, the device MUST use a PASID TLP prefix for memory transactions of the following types initiated by any virtqueue in the group: ... The device MUST NOT use a PASID TLP prefix for MSI memory transactions. " Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] Some thoughts on zero-copy between VM domains for discussion
On Fri, Jan 14, 2022 at 02:28:11PM +, Alex Bennée wrote: > > Stefan Hajnoczi writes: > > > [[PGP Signed Part:Undecided]] > > On Thu, Jan 06, 2022 at 05:03:38PM +, Alex Bennée wrote: > >> > >> Hi, > >> > >> To start the new year I thought would dump some of my thoughts on > >> zero-copy between VM domains. For project Stratos we've gamely avoided > >> thinking too hard about this while we've been concentrating on solving > >> more tractable problems. However we can't put it off forever so lets > >> work through the problem. > >> > >> Memory Sharing > > >> Buffer Allocation > > >> Transparent fallback and scaling > > >> > >> - what other constraints we need to take into account? > >> - can we leverage existing sub-systems to build this support? > >> > >> I look forward to your thoughts ;-) > > > > (Side note: Shared Virtual Addressing (https://lwn.net/Articles/747230/) > > is an interesting IOMMU feature. It would be neat to have a CPU > > equivalent where loads and stores from/to another address space could be > > done cheaply with CPU support. I don't think this is possible today and > > that's why software IOMMUs are slow for per-transaction page protection. > > In other words, a virtio-net TX VM would set up a page table allowing > > read access only to the TX buffers and vring and the virtual network > > switch VM would have the capability to access the vring and buffers > > through the TX VM's dedicated address space.) > > Does binding a device to an address space mean the driver allocations > will be automatically done from the address space or do the drivers need > modifying to take advantage of that? Jean-Phillipe? The drivers do need modifying and the APIs are very different: with SVA you're assigning partitions of devices (for example virtqueues) to different applications. So the kernel driver only does management and userspace takes care of the data path. A device partition accesses the whole address space of the process it is assigned to, so there is no explicit DMA allocation. Note that it also requires special features from the device (multiple address spaces with PCIe PASID, recoverable DMA page faults with PRI). And the concept doesn't necessarily fit nicely with all device classes - you probably don't want to load a full network stack in any application that needs to send a couple of packets. One that demonstrates the concept well in my opinion is the crypto/compression accelerators that use SVA in Linux [1]: any application that needs fast compression or encryption on some of its memory can open a queue in the accelerator, submit jobs with pointers to input and output buffers and wait for completion while doing something else. It is several orders of magnitude faster than letting the CPU do this work and only the core mm deals with memory management. Anyway this seems mostly off-topic. As Stefan said what would be useful for our problem is help from the CPUs to share bits of address space without disclosing the whole VM memory. At the moment any vIOMMU mapping needs to be shadowed by the hypervisor somehow. Unless we use static shared memory regions, frontend VM needs to tell the hypervisor which pages it chooses to share with backend VM, and the hypervisor to map then unmap that into backend VM's stage-2. I think that operation requires two context switches to the hypervisor any way we look at it. Thanks, Jean [1] https://lore.kernel.org/linux-iommu/1579097568-17542-1-git-send-email-zhangfei@linaro.org/ > > > Some storage and networking applications use buffered I/O where the > > guest kernel owns the DMA buffer while others use zero-copy I/O where > > guest userspace pages are pinned for DMA. I think both cases need to be > > considered. > > > > Are guest userspace-visible API changes allowed (e.g. making the > > userspace application aware at buffer allocation time)? > > I assume you mean enhanced rather than breaking APIs here? I don't see > why not. Certainly for the vhost-user backends we are writing we aren't > beholden to sticking to an old API. > > > Ideally the > > requirement would be that zero-copy must work for unmodified > > applications, but I'm not sure if that's realistic. > > > > By the way, VIRTIO 1.2 introduces Shared Memory Regions. These are > > memory regions (e.g. PCI BAR ranges) that the guest driver can map. If > > the host pages must come from a special pool then Shared Memory Regions > > would be one way to configure the guest to use this special zero-copy > > area for data buffers and even vrings. New VIRTIO feature bits and > > transport-specific information may need to be added to do this. > > Are these fixed sizes or could be accommodate a growing/shrinking region? > > Thanks for the pointers. > > -- > Alex Bennée - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail:
Re: [virtio-dev] Re: [PATCH V2 2/2] virtio-pci: add PASID configuration extended capability
On Fri, Jan 14, 2022 at 11:15:35AM +0800, Jason Wang wrote: > > 在 2022/1/13 下午11:17, Michael S. Tsirkin 写道: > > On Thu, Jan 13, 2022 at 02:53:36PM +, Stefan Hajnoczi wrote: > > > On Thu, Jan 13, 2022 at 05:45:11AM -0500, Michael S. Tsirkin wrote: > > > > On Thu, Jan 13, 2022 at 10:32:11AM +, Stefan Hajnoczi wrote: > > > > > > So my understanding is that instead of delaying the response to read > > > > > > it's better to simply present the previous PASID value if there're > > > > > > any > > > > > > pending transactions of previous PASID value. The driver can then > > > > > > choose to poll or using timers etc. > > > > Don't we limit these changes to before DRIVER_OK? > > > > If we do then no previous transactions can exist. > > > Not sure if that's possible for the vDPA virtqueue assignment use case > > > where one large VIRTIO device provides virtqueues for many smaller vDPA > > > devices? > > > That could be the case but usually a smaller vDPA device requires a bunch of > hardware resources more than just virtqueues. > > Another example is assign a queue directly to userspace with PASID for SVA. > > > > > > > > Stefan > > Then I guess queue must be stopped? > > > Yes, since we have virtqueue reset, I can limit the assigning before > DRIVER_OK or queue(s) are reset. Does this work? Yes, I was wondering how this would work for the SVA use-case, where processes can request queues from the virtio driver at runtime: * Process opens a fd to the queue * virtio driver selects a disabled queue, gets a PASID for that task and assigns it to the queue's group * process mmaps its queue and works with it. When it is done, it closes the fd. * virtio driver now makes sure that the virtqueue does not issue any more DMA with this PASID, it disables all the queues in the group by writing 1 to their queue_reset. After reset completes, it clears the group_pasid field. * The driver releases the PASID (iommu_sva_unbind_device()). Both PASID and queue can now safely be assigned to other processes. So making group_pasid field writeable only if all queues in the group are disabled should work Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] Re: [PATCH V2 2/2] virtio-pci: add PASID configuration extended capability
Hi Jason, On Thu, Jan 13, 2022 at 09:24:16AM +0800, Jason Wang wrote: > The device MUST use PASID TLP prefix for all memory transactions > initiated by the virtqueue that belong to a virtqueue group ... > > > vring reads/writes and data buffer reads/writes come > > to mind. Virtqueue MSI-X messages are probably not included? Anything > > else? > > According to the PCIe spec and the semantic, the PASID TLP prefix > should be used for > > Memory Requests (including AtomicOp Requests) with Untranslated Addresses. > Address Translation Requests, ATS Invalidation Messages, Page Request > Messages, and PRG Response Messages > > So we probably don't need to differentiate MSI-X messages with DMA > here. That's why I think a general "memory transaction" should be > sufficient here. If you don't agree, I can clarify it as what PCIe > spec did. I think we should be explicit about this particular case because someone implementing this extension might get it wrong. MSIs should not have a PASID because IOMMUs don't support it: * VT-d treats Requests-with-PASID to the special range 0xfeex as normal DMA (3.14 Handling Requests to Interrupt Address Range) * AMD IOMMU reports an error for MSIs with PASID (2.2.7.7 PCIe® TLP PASID Prefix). * Arm requires creating mappings to the MSI controller in the SMMU. This has implications for SVA where the PASID accesses the whole process address space: if MSI transactions have a PASID prefix, that requires mapping the MSI controller into the process address space on Arm, which isn't convenient. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] [PATCH v3] virtio-iommu: Rework the bypass feature
Hi Halil, On Thu, Nov 11, 2021 at 11:26:36AM +0100, Halil Pasic wrote: > On Fri, 22 Oct 2021 13:12:20 +0100 > Jean-Philippe Brucker wrote: > > > The VIRTIO_IOMMU_F_BYPASS feature is awkward to use and incomplete. > > Although it is implemented by QEMU, it is not supported by any driver as > > far as I know. Replace it with a new VIRTIO_IOMMU_F_BYPASS_CONFIG > > feature. > > Hi Jean-Philippe! > > Is somebody working on the implementation of this? I would like to be > Cc-ed with the implementation patches if possible. Yes I posted the Linux support: https://lore.kernel.org/linux-iommu/20211013121052.518113-1-jean-phili...@linaro.org/ And QEMU, which depends on the Linux header: https://lore.kernel.org/qemu-devel/20210930185050.262759-1-jean-phili...@linaro.org/ I plan to resend after the merge window and will Cc you. > The interaction of s390x and IOMMU is sometimes a little tricky. I would > like to make sure nothing breaks for us. :) Bypass should be constrained within the virtio-iommu device and driver, and so shouldn't break s390 as far as I know. I haven't thought much about supporting virtio-iommu for archs other than arm and x86 for the moment (in general it should be as simple as adding it to ACPI tables or DT). I believe power already has a PV IOMMU interface and doesn't need a dedicated vIOMMU device, but I'm not sure where s390 stands Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v3] virtio-iommu: Rework the bypass feature
The VIRTIO_IOMMU_F_BYPASS feature is awkward to use and incomplete. Although it is implemented by QEMU, it is not supported by any driver as far as I know. Replace it with a new VIRTIO_IOMMU_F_BYPASS_CONFIG feature. Two features are missing from virtio-iommu: * The ability for an hypervisor to start the device in bypass mode. The wording for VIRTIO_IOMMU_F_BYPASS is not clear enough to allow it at the moment, because it only specifies the behavior after feature negotiation. * The ability for a guest to set individual endpoints in bypass mode when bypass is globally disabled. An OS should have the ability to allow only endpoints it trusts to bypass the IOMMU, while keeping DMA disabled for endpoints it isn't even aware of. At the moment this can only be emulated by creating identity mappings. The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a 'bypass' config field that allows to enable and disable bypass globally. It also adds a new flag for the ATTACH request. * The hypervisor can start the VM with bypass enabled or, if it knows that the software stack supports it, disabled. The 'bypass' config fields is initialized to 0 or 1. It is sticky and isn't affected by device reset. * Generally the firmware won't have an IOMMU driver and will need to be started in bypass mode, so the bootloader and kernel can be loaded from storage endpoint. For more security, the firmware could implement a minimal virtio-iommu driver that reuses existing virtio support and only touches the config space. It could enable PCI bus mastering in bridges only for the endpoints that need it, enable global IOMMU bypass by flipping a bit, then tear everything down before handing control over to the OS. This prevents vulnerability windows where a malicious endpoint reprograms the IOMMU while the OS is configuring it [1]. The isolation provided by vIOMMUs has mainly been used for securely assigning endpoints to untrusted applications so far, while kernel DMA bypasses the IOMMU. But we can expect boot security to become as important in virtualization as it presently is on bare-metal systems, where some devices are untrusted and must never be able to access memory that wasn't assigned to them. * The OS can enable and disable bypass globally. It can then enable bypass for individual endpoints by attaching them to bypass domains, using the new VIRTIO_IOMMU_ATTACH_F_BYPASS flag. It can disable bypass by attaching them to normal domains. [1] IOMMU protection against I/O attacks: a vulnerability and a proof of concept Morgan, B., Alata, É., Nicomette, V. et al. https://link.springer.com/article/10.1186/s13173-017-0066-7 Fixes: https://github.com/oasis-tcs/virtio-spec/issues/119 Signed-off-by: Jean-Philippe Brucker --- The virtio-iommu spec with colored diff is available at https://jpbrucker.net/virtio-iommu/spec-bypass/virtio-iommu-f-bypass-config-v3-diff.pdf v1: https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07817.html v2: https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07898.html v3: * Normative statement about device reset vs. system reset - the bypass bit is sticky across device reset to avoid the vulnerability described above, but restored on system reset (a term also used by the virtio memory device). * Explain that the field bypass is in effect as long as the new feature is offered, even when not accepted by the driver * Another clarification about the state of an endpoint after detach. --- conformance.tex | 1 - virtio-iommu.tex | 111 ++- 2 files changed, 90 insertions(+), 22 deletions(-) diff --git a/conformance.tex b/conformance.tex index c2d7959..5c10ab9 100644 --- a/conformance.tex +++ b/conformance.tex @@ -504,7 +504,6 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets} \begin{itemize} \item \ref{devicenormative:Device Types / IOMMU Device / Feature bits} \item \ref{devicenormative:Device Types / IOMMU Device / Device configuration layout} -\item \ref{devicenormative:Device Types / IOMMU Device / Device Initialization} \item \ref{devicenormative:Device Types / IOMMU Device / Device operations} \item \ref{devicenormative:Device Types / IOMMU Device / Device operations / ATTACH request} \item \ref{devicenormative:Device Types / IOMMU Device / Device operations / DETACH request} diff --git a/virtio-iommu.tex b/virtio-iommu.tex index 08b358a..f8cbe88 100644 --- a/virtio-iommu.tex +++ b/virtio-iommu.tex @@ -59,14 +59,21 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits} VIRTIO_IOMMU_F_MAP_UNMAP is supported.} \item[VIRTIO_IOMMU_F_BYPASS (3)] - When not attached to a domain, endpoints downstream of the IOMMU - can access the guest-physical address space. + Endpoints that are not attached to a domain are in bypass mode. \item[VIRTIO_IOMMU_F_PROBE (4)] The PROBE request is available. \item
Re: [virtio-dev] [PATCH v2] virtio-iommu: Rework the bypass feature
On Thu, Oct 14, 2021 at 02:05:30AM +, Tian, Kevin wrote: > > > > For more security, the firmware could implement a minimal virtio- > > iommu > > > > driver that reuses existing virtio support and only touches the config > > > > space. It could enable PCI bus mastering in bridges only for the > > > > endpoints that need it, enable global IOMMU bypass by flipping a bit, > > > > > > then untrusted devices can do DMA after global bypass is enabled? > > > > Not necessarily, I think firmware could configure PCI switch ports > > upstream of untrusted devices, by keeping Bus Master Enable disabled in > > their config space which will block Memory and I/O Requests. > > guest firmware only manages virtual switch ports. Physically they are > supposed to be already enabled in the process of device passthrough. Right I was considering virtual devices and virtual switches. For physical devices, the switch emulation would need to disable the IOMMU mappings when bus mastering is disabled in the switch. In either case, though, it's probably more complexity than hypervisors are willing to put in. QEMU seems to ignore Bus Master Enable on root ports, for example. > Given a device is untrusted, it may ignore Bus Master Enable bit > inadvertently or deliberately to issue DMA. I feel concept-wise it's > not good for the firmware to open this window when boot-bypass is > disabled... > > > > > > suppose > > > here just needs attach storage device to a normal domain, leaving other > > > devices still blocked from doing DMA > > > > Yes that's the safest route, but also more complicated to implement. > > If firmware can setup the PCI topology as above, its virtio-iommu driver > > just needs to poke the config space for feature negotiation and enabling > > bypass, no need to setup virtqueues. > > I see the value of this simplified approach. But if it cannot isolate > untrusted devices then we may have to follow the complicated route Agreed Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] [PATCH v2] virtio-iommu: Rework the bypass feature
On Wed, Oct 13, 2021 at 08:37:51AM +, Tian, Kevin wrote: > > * The ability for an hypervisor to start the device in bypass mode. The > > wording for VIRTIO_IOMMU_F_BYPASS is not clear enough to allow it at > > the moment. > > unclear because it's worded around driver negotiation thus the behavior > is undefined before the driver is loaded? Yes exactly. The spec doesn't spell out what the behavior is before feature negotiation, even though the straightforward interpretation is that DMA doesn't bypass (which prevents an hypervisor from implementing boot-bypass at the moment). > > > > * The ability for a guest to set individual endpoints in bypass mode > > when bypass is globally disabled. An OS should have the ability to > > allow only endpoints it trusts to bypass the IOMMU, while keeping DMA > > disabled for endpoints it isn't even aware of. At the moment this can > > only be emulated by creating identity mappings. > > Is driver required to first disable global bypass before it can attach > endpoints to normal domains? No > > > > > The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a 'bypass' config field > > that allows to enable and disable bypass globally. It also adds a new > > flag for the ATTACH request. > > > > * The hypervisor can start the VM with bypass enabled or, if it knows > > that the software stack supports it, disabled. The 'bypass' config > > fields resets to 0 or 1. > > is sticky and unchanged by reset > > > > > * Generally the firmware won't have an IOMMU driver and will need to be > > started in bypass mode, so the bootloader and kernel can be loaded > > from storage endpoint. > > How does the firmware know whether storage load is allowed w/o looking > at the bypass info in virtio-iommu? Either the firmware can't drive virtio-iommu, and must be booted in bypass, or it can and then it sets/clears bypass. That's why we're going to default qemu to boot-bypass [1], because in general the firmware won't support anything else. An admin that has more knowledge in the software stack, and knows that the firmware is able to drive the IOMMU, can then disable boot-bypass for more security. [1] https://lore.kernel.org/qemu-devel/20210930185050.262759-1-jean-phili...@linaro.org/ > > > > > For more security, the firmware could implement a minimal virtio-iommu > > driver that reuses existing virtio support and only touches the config > > space. It could enable PCI bus mastering in bridges only for the > > endpoints that need it, enable global IOMMU bypass by flipping a bit, > > then untrusted devices can do DMA after global bypass is enabled? Not necessarily, I think firmware could configure PCI switch ports upstream of untrusted devices, by keeping Bus Master Enable disabled in their config space which will block Memory and I/O Requests. > suppose > here just needs attach storage device to a normal domain, leaving other > devices still blocked from doing DMA Yes that's the safest route, but also more complicated to implement. If firmware can setup the PCI topology as above, its virtio-iommu driver just needs to poke the config space for feature negotiation and enabling bypass, no need to setup virtqueues. > > > then tear everything down before handing control over to the OS. This > > prevents vulnerability windows where a malicious endpoint reprograms > > the IOMMU while the OS is configuring it [1]. [...] > > +An endpoint is in bypass mode if: > > +\begin{itemize} > > + \item the VIRTIO_IOMMU_F_BYPASS_CONFIG feature is offered and: > > +\begin{itemize} > > + \item config field \field{bypass} is 1 and the endpoint is not > > +attached to a domain, or > > + \item the endpoint is attached to a domain with > > +VIRTIO_IOMMU_ATTACH_F_BYPASS. > > should we mention that BYPASS_CONFIG effect doesn't depend on > negotiation? That's the distinction between "offered" vs "negotiated", but I can say it explicitly [...] > > Detach an endpoint from a domain. When this request completes, the > > -endpoint cannot access any mapping from that domain anymore. If feature > > -VIRTIO_IOMMU_F_BYPASS has been negotiated, then once this request > > -completes all accesses from the endpoint are allowed and translated by the > > -IOMMU using the identity function. > > +endpoint cannot access any mapping from that domain anymore. > > Related to my earlier question, what about global bypass is enabled? If > domain attach is allowed in that configuration, then above statement > is not accurate. The statement only refers to domain mappings, so when global bypass is enabled, after detach the endpoint can access everything. Eric also commented on this, I could add: However the endpoint may then be in bypass mode and access the guest-physical address space. So the reader can refer to the definition of bypass mode and see what happens to unattached endpoints. Thanks, Jean
[virtio-dev] [PATCH v2] virtio-iommu: Rework the bypass feature
The VIRTIO_IOMMU_F_BYPASS feature is awkward to use and incomplete. Although it is implemented by QEMU, it is not supported by any driver as far as I know. Replace it with a new VIRTIO_IOMMU_F_BYPASS_CONFIG feature. Two features are missing from virtio-iommu: * The ability for an hypervisor to start the device in bypass mode. The wording for VIRTIO_IOMMU_F_BYPASS is not clear enough to allow it at the moment. * The ability for a guest to set individual endpoints in bypass mode when bypass is globally disabled. An OS should have the ability to allow only endpoints it trusts to bypass the IOMMU, while keeping DMA disabled for endpoints it isn't even aware of. At the moment this can only be emulated by creating identity mappings. The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a 'bypass' config field that allows to enable and disable bypass globally. It also adds a new flag for the ATTACH request. * The hypervisor can start the VM with bypass enabled or, if it knows that the software stack supports it, disabled. The 'bypass' config fields resets to 0 or 1. * Generally the firmware won't have an IOMMU driver and will need to be started in bypass mode, so the bootloader and kernel can be loaded from storage endpoint. For more security, the firmware could implement a minimal virtio-iommu driver that reuses existing virtio support and only touches the config space. It could enable PCI bus mastering in bridges only for the endpoints that need it, enable global IOMMU bypass by flipping a bit, then tear everything down before handing control over to the OS. This prevents vulnerability windows where a malicious endpoint reprograms the IOMMU while the OS is configuring it [1]. The isolation provided by vIOMMUs has mainly been used for securely assigning endpoints to untrusted applications so far, while kernel DMA bypasses the IOMMU. But we can expect boot security to become as important in virtualization as it presently is on bare-metal systems, where some devices are untrusted and must never be able to access memory that wasn't assigned to them. * The OS can enable and disable bypass globally. It can then enable bypass for individual endpoints by attaching them to bypass domains, using the new VIRTIO_IOMMU_ATTACH_F_BYPASS flag. It can disable bypass by attaching them to normal domains. [1] IOMMU protection against I/O attacks: a vulnerability and a proof of concept Morgan, B., Alata, É., Nicomette, V. et al. https://link.springer.com/article/10.1186/s13173-017-0066-7 Fixes: https://github.com/oasis-tcs/virtio-spec/issues/119 Signed-off-by: Jean-Philippe Brucker --- The virtio-iommu spec with colored diff is available at https://jpbrucker.net/virtio-iommu/spec-bypass/virtio-iommu-f-bypass-config-v2-diff.pdf v1: https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07817.html v2: * Keep description of VIRTIO_IOMMU_F_BYPASS, moved to Device operations and merged with VIRTIO_IOMMU_F_BYPASS_CONFIG. Added normative statements about which feature devices should offer. * Describe initial and reset bypass value with normative statements. * Addressed the other comments. --- conformance.tex | 1 - virtio-iommu.tex | 102 +-- 2 files changed, 82 insertions(+), 21 deletions(-) diff --git a/conformance.tex b/conformance.tex index c52f1a4..7f70fc1 100644 --- a/conformance.tex +++ b/conformance.tex @@ -503,7 +503,6 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets} \begin{itemize} \item \ref{devicenormative:Device Types / IOMMU Device / Feature bits} \item \ref{devicenormative:Device Types / IOMMU Device / Device configuration layout} -\item \ref{devicenormative:Device Types / IOMMU Device / Device Initialization} \item \ref{devicenormative:Device Types / IOMMU Device / Device operations} \item \ref{devicenormative:Device Types / IOMMU Device / Device operations / ATTACH request} \item \ref{devicenormative:Device Types / IOMMU Device / Device operations / DETACH request} diff --git a/virtio-iommu.tex b/virtio-iommu.tex index 08b358a..fcda138 100644 --- a/virtio-iommu.tex +++ b/virtio-iommu.tex @@ -59,14 +59,21 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits} VIRTIO_IOMMU_F_MAP_UNMAP is supported.} \item[VIRTIO_IOMMU_F_BYPASS (3)] - When not attached to a domain, endpoints downstream of the IOMMU - can access the guest-physical address space. + Endpoints that are not attached to a domain are in bypass mode. \item[VIRTIO_IOMMU_F_PROBE (4)] The PROBE request is available. \item[VIRTIO_IOMMU_F_MMIO (5)] The VIRTIO_IOMMU_MAP_F_MMIO flag is available. + +\item[VIRTIO_IOMMU_F_BYPASS_CONFIG (6)] + Field \field{bypass} of struct virtio_iommu_config determines + whether endpoints that are not attached to a domain are in + bypass mode. Flag VIRTIO_IOMMU_ATTACH_F_BYPASS determines + whether endpoints
Re: [virtio-dev] [PATCH] virtio-iommu: Rework the bypass feature
Hi Eric, On Wed, Oct 06, 2021 at 02:53:50PM +0200, Eric Auger wrote: > > * The hypervisor can start the VM with bypass enabled or, if it knows > > that the software stack supports it, disabled. The 'bypass' config > > fields resets to 0 or 1. > I don't see the reset value documented in the actual spec while it looks > quite important for our boot bypass functionality. Or did I miss it? No you're right, I'll add this. It's important to say that the device may initialize the field to 1, but also that a reset (writing a 0 to device status) doesn't re-initialize the field to its boot value (because that may introduce a vulnerability during the FW->OS transition) If the device offers the VIRTIO_IOMMU_F_BYPASS_CONFIG feature, it MAY initialize the \field{bypass} field to 1. When the driver resets the device, the \field{bypass} field SHOULD NOT change. [...] > > diff --git a/virtio-iommu.tex b/virtio-iommu.tex > > index efa735b..a2c90ea 100644 > > --- a/virtio-iommu.tex > > +++ b/virtio-iommu.tex > > @@ -59,14 +59,19 @@ \subsection{Feature bits}\label{sec:Device Types / > > IOMMU Device / Feature bits} > >VIRTIO_IOMMU_F_MAP_UNMAP is supported.} > > > > \item[VIRTIO_IOMMU_F_BYPASS (3)] > > - When not attached to a domain, endpoints downstream of the IOMMU > > - can access the guest-physical address space. > > + This feature is deprecated. > Why are you obliged to deprecate the feature. We could leave the feature, but it adds complexity in the spec, and in drivers if they have to support both. Maybe we can leave the feature, merge its description with F_BYPASS_CONFIG and state something like: The VIRTIO_IOMMU_F_BYPASS_CONFIG feature supersedes VIRTIO_IOMMU_F_BYPASS. New devices SHOULD NOT offer VIRTIO_IOMMU_F_BYPASS. > Can't you leave it and > impose that either > > VIRTIO_IOMMU_F_BYPASS or VIRTIO_IOMMU_F_BYPASS_CONFIG gets set? Just to be clear, you mean devices must not implement both features at the same time, right? Not that devices must implement at least one of the feature? (I don't think we can do that) > > > > > \item[VIRTIO_IOMMU_F_PROBE (4)] > >The PROBE request is available. > > > > \item[VIRTIO_IOMMU_F_MMIO (5)] > >The VIRTIO_IOMMU_MAP_F_MMIO flag is available. > > + > > +\item[VIRTIO_IOMMU_F_BYPASS_CONFIG (6)] > > + The global bypass state is described in \field{bypass} of > > + struct virtio_iommu_config. The domain bypass state is > > + described by VIRTIO_IOMMU_ATTACH_F_BYPASS. > I would re-emphasize the global bypass state relates to bypass state of > unattached devices whereas domain bypass state controls bypass for > devices attached to a given domain. Right, though I think I'll drop "bypass state" and describe everything using "endpoint in bypass mode", so we can keep a single definition in one place. [...] > > \subsection{Device initialization}\label{sec:Device Types / IOMMU Device / > > Device initialization} > > > > When the device is reset, endpoints are not attached to any domain. > > > > -If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all accesses from > > -unattached endpoints are allowed and translated by the IOMMU using the > > -identity function. If the feature is not negotiated, any memory access > > -from an unattached endpoint fails. Upon attaching an endpoint in > > -bypass mode to a new domain, any memory access from the endpoint fails, > > -since the domain does not contain any mapping. > > +An endpoint is in bypass mode if the VIRTIO_IOMMU_F_BYPASS_CONFIG > > +feature is offered and: > > +\begin{itemize} > > + \item config field \field{bypass} is 1 and the endpoint is not > > +attached to a domain, or > > + \item the endpoint is attached to a domain with > > +VIRTIO_IOMMU_ATTACH_F_BYPASS. > > +\end{itemize} > > + > > +All accesses from an endpoint in bypass mode are allowed and > > +translated by the IOMMU using the identity function. I think I'll move this description to the Device operations section, not sure why it was here in the first place. > > > > Future devices might support more modes of operation besides MAP/UNMAP. > > Drivers verify that devices set VIRTIO_IOMMU_F_MAP_UNMAP and fail > > @@ -136,8 +156,8 @@ \subsection{Device initialization}\label{sec:Device > > Types / IOMMU Device / Devic > > > > \devicenormative{\subsubsection}{Device Initialization}{Device Types / > > IOMMU Device / Device Initialization} > > > > -If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the > > -device SHOULD NOT let endpoints access the guest-physical address space. > > +The device SHOULD NOT let unattached endpoints that are not in > > +bypass mode access the guest-physical address space. > SHOULD NOT or MUST NOT. Can you remind me of the difference? They are defined in rfc2119 https://datatracker.ietf.org/doc/html/rfc2119 SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstances when the particular behavior
[virtio-dev] [PATCH v2] content: Remove duplicate paragraph
It looks like commit 356aeeb40d7a ("content: add vendor specific cfg type") had a merge issue. It includes the device normative paragraph for Shared memory capability, which was already added right above it by commit 855ad7af2bd6 ("shared memory: Define PCI capability"). The two paragraphs differ, and the first paragraph is correct. It refers to struct virtio_pci_cap64 which embeds struct virtio_pci_cap: struct virtio_pci_cap64 { struct virtio_pci_cap { ... le32 offset; le32 length; } cap; u32 offset_hi; u32 length_hi; } Merge the two paragraph while keeping the best of both. Drop the spaces after \field to stay consistent with the rest of the document. Acked-by: Michael S. Tsirkin Signed-off-by: Jean-Philippe Brucker --- v1: https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07815.html --- content.tex | 15 --- 1 file changed, 4 insertions(+), 11 deletions(-) diff --git a/content.tex b/content.tex index 288e62f..39fe12b 100644 --- a/content.tex +++ b/content.tex @@ -1143,17 +1143,10 @@ \subsubsection{Shared memory capability}\label{sec:Virtio Transport Options / Vi \devicenormative{\paragraph}{Shared memory capability}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Shared memory capability} -The region defined by the combination of \field{cap.offset}, -\field{offset_hi}, and \field {cap.length}, \field{length_hi} -MUST be contained within the BAR specified by \field{cap.bar}. - -The \field{cap.id} MUST be unique for any one device instance. - -\devicenormative{\paragraph}{Shared memory capability}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Shared memory capability} - -The region defined by the combination of the \field {cap.offset}, -\field {cap.offset_hi}, and \field {cap.length}, \field -{cap.length_hi} fields MUST be contained within the declared bar. +The region defined by the combination of the \field{cap.offset}, +\field{offset_hi}, and \field{cap.length}, \field{length_hi} +fields MUST be contained within the BAR specified by +\field{cap.bar}. The \field{cap.id} MUST be unique for any one device instance. -- 2.33.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [PATCH] content: Remove duplicate paragraph
On Mon, Oct 04, 2021 at 12:56:48PM +0200, Cornelia Huck wrote: > On Thu, Sep 30 2021, Jean-Philippe Brucker wrote: > > > It looks like commit 356aeeb40d7a ("content: add vendor specific cfg > > type") had a merge issue. It includes the device normative paragraph for > > Shared memory capability, which was already added right above it by > > commit 855ad7af2bd6 ("shared memory: Define PCI capability"). Remove the > > duplicate paragraph. > > > > Signed-off-by: Jean-Philippe Brucker > > --- > > This applies onto "Fix copy/paste bug in PCI transport paragraph", that > > fixes both paragraph titles. > > https://lists.oasis-open.org/archives/virtio-comment/202108/msg00097.html > > --- > > content.tex | 8 > > 1 file changed, 8 deletions(-) > > > > diff --git a/content.tex b/content.tex > > index 4418ac8..54f7441 100644 > > --- a/content.tex > > +++ b/content.tex > > @@ -1149,14 +1149,6 @@ \subsubsection{Shared memory > > capability}\label{sec:Virtio Transport Options / Vi > > > > The \field{cap.id} MUST be unique for any one device instance. > > > > -\devicenormative{\paragraph}{Shared memory capability}{Virtio Transport > > Options / Virtio Over PCI Bus / PCI Device Layout / Shared memory > > capability} > > - > > -The region defined by the combination of the \field {cap.offset}, > > -\field {cap.offset_hi}, and \field {cap.length}, \field > > -{cap.length_hi} fields MUST be contained within the declared bar. > > - > > -The \field{cap.id} MUST be unique for any one device instance. > > - > > \subsubsection{Vendor data capability}\label{sec:Virtio > > Transport Options / Virtio Over PCI Bus / PCI Device Layout / > > Vendor data capability} > > Hm. This statement is indeed included twice, but the two are not exactly > the same. For the first paragraph, do we want to use a combination of > the two? IOW, something like > > "The region defined by the combination of the \field {cap.offset}, > \field {cap.offset_hi}, and \field {cap.length}, \field > {cap.length_hi} fields MUST be contained within the BAR specified by > \field{cap.bar}." > Makes sense, although thinking more about this it was the first paragraph that was correct, because it refers to: struct virtio_pci_cap64 { struct virtio_pci_cap { ... le32 offset; le32 length; } cap; u32 offset_hi; u32 length_hi; } So it would be \field{cap.offset} and \field{offset_hi}. I'll send a v2 Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH] virtio-iommu: Rework the bypass feature
The VIRTIO_IOMMU_F_BYPASS feature is awkward to use and incomplete. Although it is implemented by QEMU, it is not supported by any driver as far as I know. Replace it with a new VIRTIO_IOMMU_F_BYPASS_CONFIG feature. The old feature bit is deprecated and will be recycled once versions of QEMU that implement it are not distributed anymore. Two features are missing from virtio-iommu: * The ability for an hypervisor to start the device in bypass mode. The wording for VIRTIO_IOMMU_F_BYPASS is not clear enough to allow it at the moment. * The ability for a guest to set individual endpoints in bypass mode when bypass is globally disabled. An OS should have the ability to allow only endpoints it trusts to bypass the IOMMU, while keeping DMA disabled for endpoints it isn't even aware of. At the moment this can only be emulated by creating identity mappings. The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a 'bypass' config field that allows to enable and disable bypass globally. It also adds a new flag for the ATTACH request. * The hypervisor can start the VM with bypass enabled or, if it knows that the software stack supports it, disabled. The 'bypass' config fields resets to 0 or 1. * Generally the firmware won't have an IOMMU driver and will need to be started in bypass mode, so the bootloader and kernel can be loaded from storage endpoint. For more security, the firmware could implement a minimal virtio-iommu driver that reuses existing virtio support and only touches the config space. It could enable PCI bus mastering in bridges only for the endpoints that need it, enable global IOMMU bypass by flipping a bit, then tear everything down before handing control over to the OS. This prevents vulnerability windows where a malicious endpoint reprograms the IOMMU while the OS is configuring it [1]. The isolation provided by vIOMMUs has mainly been used for securely assigning endpoints to untrusted applications so far, while kernel DMA bypasses the IOMMU. But we can expect boot security to become as important in virtualization as it presently is on bare-metal systems, where some devices are untrusted and must never be able to access memory that wasn't assigned to them. * The OS can enable and disable bypass globally. It can then enable bypass for individual endpoints by attaching them to bypass domains, using the new VIRTIO_IOMMU_ATTACH_F_BYPASS flag. It can disable bypass by attaching them to normal domains. [1] IOMMU protection against I/O attacks: a vulnerability and a proof of concept Morgan, B., Alata, É., Nicomette, V. et al. https://link.springer.com/article/10.1186/s13173-017-0066-7 Signed-off-by: Jean-Philippe Brucker --- The virtio-iommu spec with colored diff is available at https://jpbrucker.net/virtio-iommu/spec-bypass/virtio-iommu-f-bypass-config-v1-diff.pdf Apologies for the poorly thought out VIRTIO_IOMMU_F_BYPASS. I didn't spend enough time on it and ignored valuable suggestions. --- virtio-iommu.tex | 69 +--- 1 file changed, 53 insertions(+), 16 deletions(-) diff --git a/virtio-iommu.tex b/virtio-iommu.tex index efa735b..a2c90ea 100644 --- a/virtio-iommu.tex +++ b/virtio-iommu.tex @@ -59,14 +59,19 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits} VIRTIO_IOMMU_F_MAP_UNMAP is supported.} \item[VIRTIO_IOMMU_F_BYPASS (3)] - When not attached to a domain, endpoints downstream of the IOMMU - can access the guest-physical address space. + This feature is deprecated. \item[VIRTIO_IOMMU_F_PROBE (4)] The PROBE request is available. \item[VIRTIO_IOMMU_F_MMIO (5)] The VIRTIO_IOMMU_MAP_F_MMIO flag is available. + +\item[VIRTIO_IOMMU_F_BYPASS_CONFIG (6)] + The global bypass state is described in \field{bypass} of + struct virtio_iommu_config. The domain bypass state is + described by VIRTIO_IOMMU_ATTACH_F_BYPASS. + \end{description} \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits} @@ -97,12 +102,19 @@ \subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / le32 end; } domain_range; le32 probe_size; + u8 bypass; + u8 reserved[3]; }; \end{lstlisting} \drivernormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout} -The driver MUST NOT write to device configuration fields. +When the VIRTIO_IOMMU_F_BYPASS_CONFIG feature is negotiated, the +driver MAY write to \field{bypass}. The driver MUST NOT write to +any other device configuration field. + +If field \field{bypass} contains a value different than 0 or 1, +the driver SHOULD treat it as 0. \devicenormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout} @@ -110,16 +122,24 @@ \subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / the page granularity
[virtio-dev] [PATCH] content: Remove duplicate paragraph
It looks like commit 356aeeb40d7a ("content: add vendor specific cfg type") had a merge issue. It includes the device normative paragraph for Shared memory capability, which was already added right above it by commit 855ad7af2bd6 ("shared memory: Define PCI capability"). Remove the duplicate paragraph. Signed-off-by: Jean-Philippe Brucker --- This applies onto "Fix copy/paste bug in PCI transport paragraph", that fixes both paragraph titles. https://lists.oasis-open.org/archives/virtio-comment/202108/msg00097.html --- content.tex | 8 1 file changed, 8 deletions(-) diff --git a/content.tex b/content.tex index 4418ac8..54f7441 100644 --- a/content.tex +++ b/content.tex @@ -1149,14 +1149,6 @@ \subsubsection{Shared memory capability}\label{sec:Virtio Transport Options / Vi The \field{cap.id} MUST be unique for any one device instance. -\devicenormative{\paragraph}{Shared memory capability}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Shared memory capability} - -The region defined by the combination of the \field {cap.offset}, -\field {cap.offset_hi}, and \field {cap.length}, \field -{cap.length_hi} fields MUST be contained within the declared bar. - -The \field{cap.id} MUST be unique for any one device instance. - \subsubsection{Vendor data capability}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Vendor data capability} -- 2.33.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [PATCH] virtio-iommu: Add VIRTIO_IOMMU_F_BOOT_BYPASS
On Thu, Feb 25, 2021 at 02:11:17PM -0500, Michael S. Tsirkin wrote: > On Thu, Feb 25, 2021 at 06:53:15PM +0100, Jean-Philippe Brucker wrote: > > Specify the behavior of the device before feature negotiation. > > Implementations that allow DMA to bypass the IOMMU during boot inform > > the driver by setting the VIRTIO_IOMMU_F_BOOT_BYPASS feature. > > Negotiating the feature doesn't have any effect. > > from spec text it kind of looks like it does, after > FEATURES_OK devices are disallowed access? Before FEATURES_OK the BOOT_BYPASS feature defines the policy chosen by the device implementation. After FEATURES_OK the driver overrides this policy using the BYPASS feature. Thinking more about this, we can't redefine F_BYPASS now (QEMU offers it), but I'm tempted to deprecate it and replace it with a new feature bit that indicates presence of a bypass field in config space. Device sets the byte to 0 or 1 to declare its default bypass policy, and driver can override this by writing 0 or 1 (currently done by accepting or refusing F_BYPASS). It would be a lot cleaner than this. Or just state that the boot-bypass behavior is up to the implementation and leave it at that. > > Clarify the description for VIRTIO_IOMMU_F_BYPASS while we're at it, > > because "downstream of the IOMMU" is confusing. > > > > Signed-off-by: Jean-Philippe Brucker > > are the two bypass features dependend on each other then? No the device can offer one without the other. Thanks, Jean > > > --- > > virtio-iommu.tex | 28 > > 1 file changed, 20 insertions(+), 8 deletions(-) > > > > diff --git a/virtio-iommu.tex b/virtio-iommu.tex > > index 08b358a..4f34a14 100644 > > --- a/virtio-iommu.tex > > +++ b/virtio-iommu.tex > > @@ -59,7 +59,7 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU > > Device / Feature bits} > >VIRTIO_IOMMU_F_MAP_UNMAP is supported.} > > > > \item[VIRTIO_IOMMU_F_BYPASS (3)] > > - When not attached to a domain, endpoints downstream of the IOMMU > > + When not attached to a domain, endpoints managed by the IOMMU > >can access the guest-physical address space. > > > > \item[VIRTIO_IOMMU_F_PROBE (4)] > > @@ -67,6 +67,10 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU > > Device / Feature bits} > > > > \item[VIRTIO_IOMMU_F_MMIO (5)] > >The VIRTIO_IOMMU_MAP_F_MMIO flag is available. > > + > > +\item[VIRTIO_IOMMU_F_BOOT_BYPASS (6)] > > + Before feature negotiation, endpoints managed by the IOMMU > > + can access the guest-physical address space. > > \end{description} > > > > \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device > > / Feature bits} > > @@ -114,12 +118,15 @@ \subsection{Device initialization}\label{sec:Device > > Types / IOMMU Device / Devic > > > > When the device is reset, endpoints are not attached to any domain. > > > > -If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all accesses from > > -unattached endpoints are allowed and translated by the IOMMU using the > > -identity function. If the feature is not negotiated, any memory access > > -from an unattached endpoint fails. Upon attaching an endpoint in > > -bypass mode to a new domain, any memory access from the endpoint fails, > > -since the domain does not contain any mapping. > > +Memory accesses from an endpoint bypass the IOMMU, that is all > > +accesses are allowed and translated using the identity function, > > +in the following cases: > > +\begin{itemize} > > +\item If the VIRTIO_IOMMU_F_BOOT_BYPASS feature is offered and > > + the FEATURES_OK status bit is not set. > > confused. so this feature *only* has effect before FEATURES_OK? > > > > +\item If the VIRTIO_IOMMU_F_BYPASS feature is negotiated and the > > + endpoint is not attached to a domain. > > +\end{itemize} > > > > Future devices might support more modes of operation besides MAP/UNMAP. > > Drivers verify that devices set VIRTIO_IOMMU_F_MAP_UNMAP and fail > > @@ -136,8 +143,13 @@ \subsection{Device initialization}\label{sec:Device > > Types / IOMMU Device / Devic > > > > \devicenormative{\subsubsection}{Device Initialization}{Device Types / > > IOMMU Device / Device Initialization} > > > > +If the device does not offer the VIRTIO_IOMMU_F_BOOT_BYPASS > > +feature, it SHOULD NOT let endpoints access the guest-physical > > +address space before feature negotiation is complete. > > + > > If the driver does not accept the VI
[virtio-dev] [PATCH] virtio-iommu: Add VIRTIO_IOMMU_F_BOOT_BYPASS
Specify the behavior of the device before feature negotiation. Implementations that allow DMA to bypass the IOMMU during boot inform the driver by setting the VIRTIO_IOMMU_F_BOOT_BYPASS feature. Negotiating the feature doesn't have any effect. Clarify the description for VIRTIO_IOMMU_F_BYPASS while we're at it, because "downstream of the IOMMU" is confusing. Signed-off-by: Jean-Philippe Brucker --- virtio-iommu.tex | 28 1 file changed, 20 insertions(+), 8 deletions(-) diff --git a/virtio-iommu.tex b/virtio-iommu.tex index 08b358a..4f34a14 100644 --- a/virtio-iommu.tex +++ b/virtio-iommu.tex @@ -59,7 +59,7 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits} VIRTIO_IOMMU_F_MAP_UNMAP is supported.} \item[VIRTIO_IOMMU_F_BYPASS (3)] - When not attached to a domain, endpoints downstream of the IOMMU + When not attached to a domain, endpoints managed by the IOMMU can access the guest-physical address space. \item[VIRTIO_IOMMU_F_PROBE (4)] @@ -67,6 +67,10 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits} \item[VIRTIO_IOMMU_F_MMIO (5)] The VIRTIO_IOMMU_MAP_F_MMIO flag is available. + +\item[VIRTIO_IOMMU_F_BOOT_BYPASS (6)] + Before feature negotiation, endpoints managed by the IOMMU + can access the guest-physical address space. \end{description} \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits} @@ -114,12 +118,15 @@ \subsection{Device initialization}\label{sec:Device Types / IOMMU Device / Devic When the device is reset, endpoints are not attached to any domain. -If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all accesses from -unattached endpoints are allowed and translated by the IOMMU using the -identity function. If the feature is not negotiated, any memory access -from an unattached endpoint fails. Upon attaching an endpoint in -bypass mode to a new domain, any memory access from the endpoint fails, -since the domain does not contain any mapping. +Memory accesses from an endpoint bypass the IOMMU, that is all +accesses are allowed and translated using the identity function, +in the following cases: +\begin{itemize} +\item If the VIRTIO_IOMMU_F_BOOT_BYPASS feature is offered and + the FEATURES_OK status bit is not set. +\item If the VIRTIO_IOMMU_F_BYPASS feature is negotiated and the + endpoint is not attached to a domain. +\end{itemize} Future devices might support more modes of operation besides MAP/UNMAP. Drivers verify that devices set VIRTIO_IOMMU_F_MAP_UNMAP and fail @@ -136,8 +143,13 @@ \subsection{Device initialization}\label{sec:Device Types / IOMMU Device / Devic \devicenormative{\subsubsection}{Device Initialization}{Device Types / IOMMU Device / Device Initialization} +If the device does not offer the VIRTIO_IOMMU_F_BOOT_BYPASS +feature, it SHOULD NOT let endpoints access the guest-physical +address space before feature negotiation is complete. + If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the -device SHOULD NOT let endpoints access the guest-physical address space. +device SHOULD NOT let endpoints access the guest-physical address space +after feature negotiation is complete. \subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device operations} -- 2.30.1 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH 4/4] virtio-iommu: Add Arm 64-bit page tables
Allow a driver to detect support for Arm page tables and attach a set of tables. When the host IOMMU is an Arm SMMUv2, a single address space per endpoint is supported. The driver detects the Arm LPAE page table format in a PROBE request and sends an ATTACH_TABLE request with a pointer to the page directory. When the host IOMMU is an Arm SMMUv3, an intermediate "Context Descriptor" (CD) table has to be attached. That table is indexed by PASID and contains multiple pointers to page directories: CD table Page directory +---+ ,--->+---+ : : |: : +---+ |+---+ PASID->| CD |---'VA->| PTE |--> GPA +---++---+ : :: : +---++---+ The driver detects support for both table formats in a PROBE request and sends an ATTACH_TABLE request with a pointer to the context descriptor table. These features are voluntarily omitted: * Stall, which cannot be supported without recoverable page faults. I do plan to implement this once we get to the I/O Page fault extensions. * Big endian page tables. * AArch32 LPAE format and per-armv8 short descriptor format. * Page-based Hardware Attributes (PBHA), Auxiliary Memory Attributes (AMAIR) * Memory Partioning And Monitoring (MPAM) Support for any of these could easily be added later, by adding flags to the PROBE property and fields in the ATTACH_TABLE config. However, most of these extensions might never be needed. Signed-off-by: Jean-Philippe Brucker --- introduction.tex | 6 ++ virtio-iommu.tex | 180 +++ 2 files changed, 186 insertions(+) diff --git a/introduction.tex b/introduction.tex index bc0217f..1603726 100644 --- a/introduction.tex +++ b/introduction.tex @@ -76,6 +76,12 @@ \section{Normative References}\label{sec:Normative References} \phantomsection\label{intro:I2C}\textbf{[I2C]} & I2C-bus specification and user manual, \newline\url{https://www.nxp.com/docs/en/user-guide/UM10204.pdf}\\ + \phantomsection\label{intro:SMMUv3}\textbf{[SMMUv3]} & + Arm System Memory Management Unit version 3 + \newline\url{https://developer.arm.com/documentation/ihi0070/latest} \\ + \phantomsection\label{intro:Armv8-A}\textbf{[ARMv8-A]} & + Armv8-A Architecture Reference Manual + \newline\url{https://developer.arm.com/documentation/ddi0487/latest} \\ \end{longtable} diff --git a/virtio-iommu.tex b/virtio-iommu.tex index e14bd67..bc18342 100644 --- a/virtio-iommu.tex +++ b/virtio-iommu.tex @@ -1176,3 +1176,183 @@ \subsubsection{Fault reporting}\label{sec:Device Types / IOMMU Device / Device o \footnotetext{This would happen for example if the device implements a more recent version of this specification, whose fault report contains additional fields.} + +\subsection{Table formats}\label{sec:Device Types / IOMMU Device / Table Formats} + +Supported table formats in PROBE properties and ATTACH_TABLE +requests are: +\begin{description} + \item[VIRTIO_IOMMU_PST_ARM_SMMU3 (1)] Arm SMMUv3 Context +Descriptor Tables. + \item[VIRTIO_IOMMU_PGT_ARM64 (2)] Arm VMSAv8-64 page tables. +\end{description} + +\subsubsection{Arm SMMUv3 Context Descriptor table}\label{sec:Device Types / IOMMU Device / Table Formats / Arm SMMUv3 Context Descriptor} + +Attach Context Descriptor tables (PASID tables) in the format +described in the \hyperref[intro:SMMUv3]{Arm System Memory +Management Unit v3 specification}. The table contains context +descriptors indexed by PASID, each pointing to a page directory. + +\paragraph{PROBE properties for Arm SMMUv3 Context Descriptor tables}\label{sec:Device Types / IOMMU Device / Table Formats / Arm SMMUv3 Context Descriptor / PROBE} + +\begin{lstlisting} +struct virtio_iommu_probe_pst_arm_smmu3 { + struct virtio_iommu_probe_property head; + le16 format; + u8reserved[2]; + le64 flags; +}; + +#define VIRTIO_IOMMU_PST_ARM_SMMU3_F_BTM(1ULL << 0) +\end{lstlisting} + +Supported flags are: +\begin{description} + \item[VIRTIO_IOMMU_PST_ARM_SMMU3_F_BTM] Broadcast TLB +maintenance is supported. INVALIDATE requests for +\field{caches} VIRTIO_IOMMU_INVAL_C_TLB can be replaced by +broadcast TLBI instructions. INVALIDATE requests for +\field{caches} VIRTIO_IOMMU_INVAL_C_EP_TLB, if the endpoint +has PCIe ATS enabled, are still necessary. +\end{description} + +\paragraph{ATTACH_TABLE request for Arm SMMUv3 Context Descriptor tables}\label{sec:Device Types / IOMMU Device / Table Formats / Arm SMMUv3 Context Descriptor / ATTACH_TABLE} + +\begin{lstlisting} +struct virtio_iommu_req_attach_pst_arm_smmu3 { + struct virtio_iommu_req_head head; + le32 domain; + le32 endpoint; + le16 format; + u8s1fmt; + u8s1dss; + le6
[virtio-dev] [PATCH 3/4] virtio-iommu: Add PROBE properties for tables
Add several PROBE properties required for attaching tables to a domain. The MAP_UNMAP interface is simple enough that the same global properties can be used for all endpoints managed by the device. This won't be the case for the table interface. Hardware and emulated endpoints have difference properties, which have to be described in PROBE requests. * PAGE_SIZE_MASK: supported mapping granularity * INPUT_RANGE: input address size (IOVA) * OUTPUT_SIZE: output address size (GPA) * PASID_SIZE: Process Address Space ID size. To know the maximum size of a PASID table. * TABLE_FORMAT: PASID or page table format. These will be described in subsequent patches. Signed-off-by: Jean-Philippe Brucker --- virtio-iommu.tex | 159 ++- 1 file changed, 158 insertions(+), 1 deletion(-) diff --git a/virtio-iommu.tex b/virtio-iommu.tex index 1ca2638..e14bd67 100644 --- a/virtio-iommu.tex +++ b/virtio-iommu.tex @@ -871,7 +871,12 @@ \subsubsection{PROBE request}\label{sec:Device Types / IOMMU Device / Device ope \subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties} \begin{lstlisting} -#define VIRTIO_IOMMU_PROBE_T_RESV_MEM 1 +#define VIRTIO_IOMMU_PROBE_T_RESV_MEM 1 +#define VIRTIO_IOMMU_PROBE_T_PAGE_SIZE_MASK 2 +#define VIRTIO_IOMMU_PROBE_T_INPUT_RANGE3 +#define VIRTIO_IOMMU_PROBE_T_OUTPUT_SIZE4 +#define VIRTIO_IOMMU_PROBE_T_PASID_SIZE 5 +#define VIRTIO_IOMMU_PROBE_T_TABLE_FORMAT 6 \end{lstlisting} \paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM} @@ -934,6 +939,158 @@ \subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device The device SHOULD NOT allow accesses from the endpoint to RESV_MEM regions to affect any other component than the endpoint and the driver. + +\paragraph{Property PAGE_SIZE_MASK}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / PAGE_SIZE_MASK} + +\begin{lstlisting} +struct virtio_iommu_probe_page_size_mask { + struct virtio_iommu_probe_property head; + u8reserved[4]; + le64 page_size_mask; +}; +\end{lstlisting} + +The PAGE_SIZE_MASK property overrides the global +\field{page_size_mask} configuration for an endpoint. + +The \field{page_size_mask} field behaves in the same way as the +global \field{page_size_mask} field, described in \ref{sec:Device +Types / IOMMU Device / Device operations}. The least significant +bit describes the mapping granularity, while additional bits are +hints. + +\drivernormative{\subparagraph}{Property PAGE_SIZE_MASK}{Device Types / IOMMU Device / Device operations / PROBE properties / PAGE_SIZE_MASK} + +The driver MUST ignore \field{reserved}. + +\devicenormative{\subparagraph}{Property PAGE_SIZE_MASK}{Device Types / IOMMU Device / Device operations / PROBE properties / PAGE_SIZE_MASK} + +The device SHOULD set \field{reserved} to zero. + +The device MAY present multiple PAGE_SIZE_MASK property per +endpoint. + + +\paragraph{Property INPUT_RANGE}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / INPUT_RANGE} + +\begin{lstlisting} +struct virtio_iommu_probe_input_range { + struct virtio_iommu_probe_property head; + u8reserved[4]; + le64 start; + le64 end; +}; +\end{lstlisting} + +The INPUT_RANGE property overrides the global \field{input_range} +configuration for an endpoint. + +Fields \field{start} and \field{end} behave in the same way as +the global \field{input_range.start} and \field{input_range.end} +fields, described in \ref{sec:Device Types / IOMMU Device / +Device operations}. + +\drivernormative{\subparagraph}{Property INPUT_RANGE}{Device Types / IOMMU Device / Device operations / PROBE properties / INPUT_RANGE} + +The driver MUST ignore \field{reserved}. + +\devicenormative{\subparagraph}{Property INPUT_RANGE}{Device Types / IOMMU Device / Device operations / PROBE properties / INPUT_RANGE} + +The device SHOULD set \field{reserved} to zero. + +The device SHOULD NOT present more than one INPUT_RANGE property +per endpoint. + +The device MAY present an INPUT_RANGE property even if it does +not offer the VIRTIO_IOMMU_F_INPUT_RANGE feature. + + +\paragraph{Property OUTPUT_SIZE}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / OUTPUT_SIZE} + +\begin{lstlisting} +struct virtio_iommu_probe_output_size { + struct virtio_iommu_probe_property head; + u8 bits_count; + u8 reserved[3]; +}; +\end{lstlisting} + +The OUTPUT_SIZE property describes the maximum guest-physical +address that the device supports for this endpoint. +\field{bits_count} is the number of bits that an address can use. + +\drivernormative{\subparagraph}{Property OUTPUT_SIZE}{Device Types / IOMMU Device / Device operations / PROBE properties / OUTPUT_SIZE} + +The driver MUST ignore \field{reserved}. + +The driver SHOULD NOT use guest-physical addresses larger
[virtio-dev] [PATCH 2/4] virtio-iommu: Add INVALIDATE request
Add a request that invalidates mappings of a domain attached with table. Signed-off-by: Jean-Philippe Brucker --- virtio-iommu.tex | 151 +++ 1 file changed, 151 insertions(+) diff --git a/virtio-iommu.tex b/virtio-iommu.tex index 840f897..1ca2638 100644 --- a/virtio-iommu.tex +++ b/virtio-iommu.tex @@ -169,6 +169,7 @@ \subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device op #define VIRTIO_IOMMU_T_UNMAP4 #define VIRTIO_IOMMU_T_PROBE5 #define VIRTIO_IOMMU_T_ATTACH_TABLE 6 +#define VIRTIO_IOMMU_T_INVALIDATE 7 \end{lstlisting} A few general-purpose status codes are defined here. @@ -612,6 +613,156 @@ \subsubsection{UNMAP request}\label{sec:Device Types / IOMMU Device / Device ope mapping, then the device SHOULD remove all mappings affected by the range and set the request \field{status} to VIRTIO_IOMMU_S_OK. +\subsubsection{INVALIDATE request}\label{sec:Device Types / IOMMU Device / Device operations / INVALIDATE request} + +\begin{lstlisting} +struct virtio_iommu_req_invalidate { + struct virtio_iommu_req_head head; + u8scope; + u8caches; + le16 flags; + le32 domain; + le32 pasid; + le64 id; + le64 virt_start; + le64 nr_pages; + u8page_size; + u8reserved[19]; + struct virtio_iommu_req_tail tail; +}; + +#define VIRTIO_IOMMU_INVAL_S_DOMAIN 1 +#define VIRTIO_IOMMU_INVAL_S_PASID2 +#define VIRTIO_IOMMU_INVAL_S_ADDRESS 3 + +#define VIRTIO_IOMMU_INVAL_C_PASID(1 << 0) +#define VIRTIO_IOMMU_INVAL_C_TLB (1 << 1) +#define VIRTIO_IOMMU_INVAL_C_EP_TLB (1 << 2) + +#define VIRTIO_IOMMU_INVAL_F_LEAF (1 << 0) +#define VIRTIO_IOMMU_INVAL_F_PASID(1 << 1) +#define VIRTIO_IOMMU_INVAL_F_ID (1 << 2) + +\end{lstlisting} + +Invalidate a mapping or configuration. When using PASID or page +tables, the driver sends an INVALIDATE request to signal changes +to table elements that could have been cached by the device, such +as a mapping removal. + +Field \field{scope} specifies which entries to invalidate: +\begin{description} + \item[VIRTIO_IOMMU_INVAL_S_DOMAIN] invalidates all cached +entries for the given \field{domain}. + \item[VIRTIO_IOMMU_INVAL_S_PASID] invalidates all cached +entries for the given \field{domain} and \field{pasid}. + \item[VIRTIO_IOMMU_INVAL_S_ADDRESS] invalidates all cached +entries for the given \field{domain}, \field{pasid} and +virtual address range. +\end{description} + +Field \field{caches} specifies which caches to invalidate: +\begin{description} + \item[VIRTIO_IOMMU_INVAL_C_PASID] Invalidate entries cached +from the PASID table. + \item[VIRTIO_IOMMU_INVAL_C_TLB] Invalidate entries from the +Translation Lookaside Buffer (TLB). + \item[VIRTIO_IOMMU_INVAL_C_EP_TLB] Invalidate entries from the +TLB of endpoints attached to the domain. For example when a +PCIe endpoint has an Address Translation Cache (ATC), +discovered and enabled through the PCIe ATS capability. +\end{description} + +For example, when the driver removes a mapping from a table +attached to the \field{domain}, it sends an INVALIDATE request +with \field{scope} VIRTIO_IOMMU_INVAL_S_ADDRESS, \field{caches} +VIRTIO_IOMMU_INVAL_C_TLB and, if necessary, +VIRTIO_IOMMU_INVAL_C_EP_TLB. To invalidate the whole address +space the driver sets the \field{scope} to +VIRTIO_IOMMU_INVAL_S_DOMAIN, or to VIRTIO_IOMMU_INVAL_S_PASID +when using a PASID table. When removing a page table directory +from a PASID table entry, the driver sends an INVALIDATE request +with \field{scope} VIRTIO_IOMMU_INVAL_S_PASID, and all +\field{caches} flags set. + +Field \field{flags} specifies additional information: +\begin{description} + \item[VIRTIO_IOMMU_INVAL_F_LEAF] Only invalidate TLB entries +cached from leaf table entries. + \item[VIRTIO_IOMMU_INVAL_F_PASID] The \field{pasid} field is +valid. + \item[VIRTIO_IOMMU_INVAL_F_ID] The \field{id} field is valid. +\end{description} + +Use of field \field{id} is specific to the table format. Some +formats use only the PASID to identify address spaces within a +domain, others use a separate ID for TLB entries. + +For a \field{scope} of VIRTIO_IOMMU_INVAL_S_ADDRESS, the +invalidation affects the range of virtual addresses of size +$\field{nr\_pages} \times 2^\field{page\_size}$, starting at +\field{virt_start}. Specifying the range size this way allows the +device to efficiently remove TLB entries. For example a page +table format could allow a 2MiB range to be mapped with either +512 4KiB pages, or a single 2MiB block. In the latter case a +single TLB entry is used, so the driver specifies a +\field{page_size} of 21 and the device does not need to iterate +over all 4KiB multiples to invalidate the range. + +The following table describes the restricted set of valid +\field{caches}, \field{flags} and fields for each \field{scope}: + +\begin{tabular}{|l|c|c|c|} +\hline + \fiel
[virtio-dev] [PATCH 1/4] virtio-iommu: Add ATTACH_TABLE request
Add a way to attach page tables to a domain. Only introduces the feature bit and the ATTACH_TABLE request. We'll also need new PROBE properties and INVALIDATE requests. I chose ATTACH_TABLE request rather than ATTACH followed by SET_TABLE because it leads to fewer domain states. With ATTACH_TABLE a domain can be free, attached-with-map or attached-with-pgtable. With ATTACH+SET_TABLE, a domain can be free, attached, attached-with-map or attached-with-pgtable. I believe managing state transitions for ATTACH_TABLE is simpler. Signed-off-by: Jean-Philippe Brucker --- virtio-iommu.tex | 74 ++-- 1 file changed, 65 insertions(+), 9 deletions(-) diff --git a/virtio-iommu.tex b/virtio-iommu.tex index efa735b..840f897 100644 --- a/virtio-iommu.tex +++ b/virtio-iommu.tex @@ -54,9 +54,7 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits} The number of domains supported is described in \field{domain_range}. \item[VIRTIO_IOMMU_F_MAP_UNMAP (2)] - Map and unmap requests are available.\footnote{Future extensions may add - different modes of operations. At the moment, only - VIRTIO_IOMMU_F_MAP_UNMAP is supported.} + Map and unmap requests are available. \item[VIRTIO_IOMMU_F_BYPASS (3)] When not attached to a domain, endpoints downstream of the IOMMU @@ -67,6 +65,9 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits} \item[VIRTIO_IOMMU_F_MMIO (5)] The VIRTIO_IOMMU_MAP_F_MMIO flag is available. + +\item[VIRTIO_IOMMU_F_ATTACH_TABLE (6)] + The ATTACH_TABLE and INVALIDATE requests are available. \end{description} \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits} @@ -77,7 +78,8 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits} \devicenormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits} -The device SHOULD offer feature bit VIRTIO_IOMMU_F_MAP_UNMAP. +The device MUST offer at least one of the +VIRTIO_IOMMU_F_MAP_UNMAP or VIRTIO_IOMMU_F_ATTACH_TABLE features. \subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / Device configuration layout} @@ -161,11 +163,12 @@ \subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device op Type may be one of: \begin{lstlisting} -#define VIRTIO_IOMMU_T_ATTACH 1 -#define VIRTIO_IOMMU_T_DETACH 2 -#define VIRTIO_IOMMU_T_MAP3 -#define VIRTIO_IOMMU_T_UNMAP 4 -#define VIRTIO_IOMMU_T_PROBE 5 +#define VIRTIO_IOMMU_T_ATTACH 1 +#define VIRTIO_IOMMU_T_DETACH 2 +#define VIRTIO_IOMMU_T_MAP 3 +#define VIRTIO_IOMMU_T_UNMAP4 +#define VIRTIO_IOMMU_T_PROBE5 +#define VIRTIO_IOMMU_T_ATTACH_TABLE 6 \end{lstlisting} A few general-purpose status codes are defined here. @@ -301,6 +304,10 @@ \subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device op The driver SHOULD ensure that endpoints that cannot be isolated from each other are attached to the same domain. +The driver SHOULD NOT create the \field{domain} with an ATTACH +request if the VIRTIO_IOMMU_F_MAP_UNMAP feature was not +negotiated. In this case the ATTACH_TABLE request is used. + \devicenormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / Device operations / ATTACH request} If the \field{reserved} field of an ATTACH request is not zero, the device @@ -330,6 +337,49 @@ \subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device op A device that does not reject the request MUST attach the endpoint. +\subsubsection{ATTACH_TABLE request}\label{ref:Device Types / IOMMU Device / Device operations / ATTACH_TABLE request} + +\begin{lstlisting} +struct virtio_iommu_req_attach_table { + struct virtio_iommu_req_head head; + le32 domain; + le32 endpoint; + le16 format; + u8 descriptor[62]; + struct virtio_iommu_req_tail tail; +}; +\end{lstlisting} + +Attach an endpoint to a domain, in the same way as an ATTACH +request. In addition, provide a pointer to a table describing the +mappings. Instead of using MAP and UNMAP requests, the driver +creates and removes mappings by writing into the tables. When a +mapping that may have been cached in a TLB is removed, the driver +sends an INVALIDATE request. + +The driver discovers with a PROBE request the table formats +supported by the device. The device can support multiple page +tables and PASID table formats. A PASID table contains pointers +to one or more page tables. The driver chooses a format it +recognises, and sends the ATTACH_TABLE request. + +Apart from \field{domain}, \field{endpoint} and \field{format}, +each table format uses different parameters in field +\field{descriptor}. Table formats and their associated parameters +are described in section \ref{sec:Device Types / IOMMU Device / +Table Formats}. + +\drivernormative{\paragraph}{ATTACH_TABLE
[virtio-dev] [PATCH 0/4] virtio-iommu: Support page tables
Add page table support to the virtio-iommu specification, to enable acceleration and later Shared Virtual Memory. The interface currently offered by virtio-iommu is simple to implement and sufficiently efficient for solutions based on static mappings, such as DPDK running in guest userspace. While that's an essential use-case for virtio-iommu, it is not the only DMA remapping usage, and the MAP/UNMAP interface performs rather poorly for dynamic mappings. Unsurprisingly a given vIOMMU model cannot be simple, efficient and secure for all possible use-cases. It's always a compromise. This extension adds: * New PROBE properties that describe table formats available for each endpoints. * An ATTACH_TABLE request to be used in place of an ATTACH, that provides a guest-physical address to a table. * An INVALIDATE request to signal destructive changes. Those tables may be: (1) Page tables that are read directly by the hardware IOMMU, for hardware endpoints. An example for Arm SMMU is given in patch 4. The driver writes mappings into the table, and sends invalidations when removing mappings that could have been cached in an IOTLB. This divides the number of context switches to the host by at least two (much more with lazy invalidation), and with a vhost IOMMU the invalidation itself should be relatively cheap compared to an emulated viommu. I don't expect a large proliferation of table formats because they are relatively stable. A few extensible formats is more likely: two tables for Arm until they run out of PTE bits, one or two page tables for x86, and a couple for other architectures. (2) Tables that are read by a software component performing DMA, for emulated and paravirtual endpoints. These could be your garden-variety page tables from (1) but being paravirtual we have the opportunity to design more efficient solutions. For example a virtio/vhost endpoint accesses rings often and leaf buffers once. If the virtio-iommu device avoids caching leaf buffers, the driver does not need to send invalidations and we get almost no context switches for vIOMMU operations. A purposely designed page table can provide a handshake between device and driver to avoid sending invalidations for uncached mappings. (3) Tables that complement the MAP_UNMAP interface to help reduce context switches. The coIOMMU model [1] provides a DMA tracking table for co-operative pinning. An example driver implementation was submitted by Vivek Gautam [2], based on an older version of this extension (0.10-dev03). Since then I've reworked some of the structures and improved the wording. To ease review I uploaded a PDF with these changes, plus one with diff colors, here: https://jpbrucker.net/virtio-iommu/spec/virtio-iommu-v0.13.pdf https://jpbrucker.net/virtio-iommu/spec/virtio-iommu-diff-v0.12-v0.13.pdf [1] https://www.usenix.org/system/files/atc20-tian.pdf [2] https://lore.kernel.org/linux-iommu/20210115121342.15093-1-vivek.gau...@arm.com/ Jean-Philippe Brucker (4): virtio-iommu: Add ATTACH_TABLE request virtio-iommu: Add INVALIDATE request virtio-iommu: Add PROBE properties for tables virtio-iommu: Add Arm 64-bit page tables introduction.tex | 6 + virtio-iommu.tex | 564 ++- 2 files changed, 560 insertions(+), 10 deletions(-) -- 2.30.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [PATCH v3 0/6] Add virtio-iommu built-in topology
Hi Al, On Tue, Nov 03, 2020 at 01:09:04PM -0700, Al Stone wrote: > So, there are some questions about the VIOT definition and I just > don't know enough to be able to answer them. One of the ASWG members > is trying to understand the semantics behind the subtables. Thanks for the update. We dropped subtables a few versions ago, though, do you have the latest v8? https://jpbrucker.net/virtio-iommu/viot/viot-v8.pdf > Is there a particular set of people, or mailing lists, that I can > point to to get the questions answered? Ideally it would be one > of the public lists where it has already been discussed, but an > individual would be fine, too. No changes have been proposed, just > some questions asked. For a public list, I suggest io...@lists.linux-foundation.org if we should pick only one (otherwise add virtualizat...@lists.linux-foundation.org and virtio-dev@lists.oasis-open.org). I'm happy to answer any question, and the folks on here are a good set to Cc: eric.au...@redhat.com jean-phili...@linaro.org j...@8bytes.org kevin.t...@intel.com lorenzo.pieral...@arm.com m...@redhat.com sebastien.bo...@intel.com Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [PATCH v3 0/6] Add virtio-iommu built-in topology
On Fri, Sep 25, 2020 at 06:22:57AM -0400, Michael S. Tsirkin wrote: > On Fri, Sep 25, 2020 at 10:48:06AM +0200, Jean-Philippe Brucker wrote: > > On Fri, Aug 21, 2020 at 03:15:34PM +0200, Jean-Philippe Brucker wrote: > > > Add a topology description to the virtio-iommu driver and enable x86 > > > platforms. > > > > > > Since [v2] we have made some progress on adding ACPI support for > > > virtio-iommu, which is the preferred boot method on x86. It will be a > > > new vendor-agnostic table describing para-virtual topologies in a > > > minimal format. However some platforms don't use either ACPI or DT for > > > booting (for example microvm), and will need the alternative topology > > > description method proposed here. In addition, since the process to get > > > a new ACPI table will take a long time, this provides a boot method even > > > to ACPI-based platforms, if only temporarily for testing and > > > development. > > > > > > v3: > > > * Add patch 1 that moves virtio-iommu to a subfolder. > > > * Split the rest: > > > * Patch 2 adds topology-helper.c, which will be shared with the ACPI > > > support. > > > * Patch 4 adds definitions. > > > * Patch 5 adds parser in topology.c. > > > * Address other comments. > > > > > > Linux and QEMU patches available at: > > > https://jpbrucker.net/git/linux virtio-iommu/devel > > > https://jpbrucker.net/git/qemu virtio-iommu/devel > > > > I'm parking this work again, until we make progress on the ACPI table, or > > until a platform without ACPI and DT needs it. Until then, I've pushed v4 > > to my virtio-iommu/topo branch and will keep it rebased on master. > > > > Thanks, > > Jean > > I think you guys need to work on virtio spec too, not too much left to > do there ... I know it's ready and I'd really like to move on with this, but I'd rather not commit it to the spec until we know it's going to be used at all. As Gerd pointed out the one example we had, microvm, now supports ACPI. Since we've kicked off the ACPI work anyway it isn't clear that the built-in topology will be useful. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [PATCH v3 0/6] Add virtio-iommu built-in topology
On Fri, Aug 21, 2020 at 03:15:34PM +0200, Jean-Philippe Brucker wrote: > Add a topology description to the virtio-iommu driver and enable x86 > platforms. > > Since [v2] we have made some progress on adding ACPI support for > virtio-iommu, which is the preferred boot method on x86. It will be a > new vendor-agnostic table describing para-virtual topologies in a > minimal format. However some platforms don't use either ACPI or DT for > booting (for example microvm), and will need the alternative topology > description method proposed here. In addition, since the process to get > a new ACPI table will take a long time, this provides a boot method even > to ACPI-based platforms, if only temporarily for testing and > development. > > v3: > * Add patch 1 that moves virtio-iommu to a subfolder. > * Split the rest: > * Patch 2 adds topology-helper.c, which will be shared with the ACPI > support. > * Patch 4 adds definitions. > * Patch 5 adds parser in topology.c. > * Address other comments. > > Linux and QEMU patches available at: > https://jpbrucker.net/git/linux virtio-iommu/devel > https://jpbrucker.net/git/qemu virtio-iommu/devel I'm parking this work again, until we make progress on the ACPI table, or until a platform without ACPI and DT needs it. Until then, I've pushed v4 to my virtio-iommu/topo branch and will keep it rebased on master. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [PATCH v3 5/6] iommu/virtio: Support topology description in config space
On Thu, Sep 24, 2020 at 10:22:03AM -0500, Bjorn Helgaas wrote: > On Fri, Aug 21, 2020 at 03:15:39PM +0200, Jean-Philippe Brucker wrote: > > Platforms without device-tree nor ACPI can provide a topology > > description embedded into the virtio config space. Parse it. > > > > Use PCI FIXUP to probe the config space early, because we need to > > discover the topology before any DMA configuration takes place, and the > > virtio driver may be loaded much later. Since we discover the topology > > description when probing the PCI hierarchy, the virtual IOMMU cannot > > manage other platform devices discovered earlier. > > > +struct viommu_cap_config { > > + u8 bar; > > + u32 length; /* structure size */ > > + u32 offset; /* structure offset within the bar */ > > s/the bar/the BAR/ (to match comment below). > > > +static void viommu_pci_parse_topology(struct pci_dev *dev) > > +{ > > + int ret; > > + u32 features; > > + void __iomem *regs, *common_regs; > > + struct viommu_cap_config cap = {0}; > > + struct virtio_pci_common_cfg __iomem *common_cfg; > > + > > + /* > > +* The virtio infrastructure might not be loaded at this point. We need > > +* to access the BARs ourselves. > > +*/ > > + ret = viommu_pci_find_capability(dev, VIRTIO_PCI_CAP_COMMON_CFG, ); > > + if (!ret) { > > + pci_warn(dev, "common capability not found\n"); > > Is the lack of this capability really an error, i.e., is this > pci_warn() or pci_info()? The "device doesn't have topology > description" below is only pci_dbg(), which suggests that we can live > without this. At this point we know that this is a (modern) virtio-pci device which, according to the virtio 1.0 specification, must have this capability. So this is definitely an error, but the topology description is an optional feature. > > Maybe a hint about what "common capability" means? Yes, "virtio-pci common configuration capability" would be more appropriate > > > + return; > > + } > > + > > + if (pci_enable_device_mem(dev)) > > + return; > > + > > + common_regs = pci_iomap(dev, cap.bar, 0); > > + if (!common_regs) > > + return; > > + > > + common_cfg = common_regs + cap.offset; > > + > > + /* Perform the init sequence before we can read the config */ > > + ret = viommu_pci_reset(common_cfg); > > I guess this is some special device-specific reset, not any kind of > standard PCI reset? Yes it's the virtio reset - writing 0 to the status register in the BAR. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [PATCH v3 0/6] Add virtio-iommu built-in topology
On Thu, Sep 24, 2020 at 12:02:55PM +0200, Joerg Roedel wrote: > On Thu, Sep 24, 2020 at 05:38:13AM -0400, Michael S. Tsirkin wrote: > > On Thu, Sep 24, 2020 at 11:21:29AM +0200, Joerg Roedel wrote: > > > On Thu, Sep 24, 2020 at 05:00:35AM -0400, Michael S. Tsirkin wrote: > > > > OK so this looks good. Can you pls repost with the minor tweak > > > > suggested and all acks included, and I will queue this? > > > > > > My NACK still stands, as long as a few questions are open: > > > > > > 1) The format used here will be the same as in the ACPI table? I > > > think the answer to this questions must be Yes, so this leads > > > to the real question: > > > > I am not sure it's a must. > > It is, having only one parser for the ACPI and MMIO descriptions was one > of the selling points for MMIO in past discussions and I think it makes > sense to keep them in sync. It's not possible to use exactly the same code for parsing. The access methods are different (need to deal with port-IO for built-in description on PCI, for example) and more importantly, the structure is different as well. The ACPI table needs nodes for virtio-iommu while the built-in description is contained in the virtio-iommu itself. So the endpoint nodes point to virtio-iommu node on ACPI, while they don't need a pointer on the built-in desc. I kept as much as possible common in structures and implementation, but in the end we still need about 200 unique lines on each side. Thanks, Jean > > > We can always tweak the parser if there are slight differences > > between ACPI and virtio formats. > > There is no guarantee that there only need to be "tweaks" until the > ACPI table format is stablized. > > Regards, > > Joerg > - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [PATCH v3 5/6] iommu/virtio: Support topology description in config space
On Fri, Sep 04, 2020 at 06:05:35PM +0200, Auger Eric wrote: > > +static inline int viommu_pci_find_capability(struct pci_dev *dev, u8 > > cfg_type, > > +struct viommu_cap_config *cap) > not sure the inline is useful here No, I'll drop it Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [PATCH v3 2/6] iommu/virtio: Add topology helpers
On Fri, Sep 04, 2020 at 06:22:12PM +0200, Auger Eric wrote: > > +/** > > + * virt_dma_configure - Configure DMA of virtualized devices > > + * @dev: the endpoint > > + * > > + * Setup the DMA and IOMMU ops of a virtual device, for platforms without > > DT or > > + * ACPI. > > + * > > + * Return: -EPROBE_DEFER if the device is managed by an IOMMU that hasn't > > been > > + * probed yet, 0 otherwise > > + */ > > +int virt_dma_configure(struct device *dev) > > +{ > > + const struct iommu_ops *iommu_ops; > > + > > + iommu_ops = virt_iommu_setup(dev); > > + if (IS_ERR_OR_NULL(iommu_ops)) { > > + int ret = PTR_ERR(iommu_ops); > > + > > + if (ret == -EPROBE_DEFER || ret == 0) > > + return ret; > > + dev_err(dev, "error %d while setting up virt IOMMU\n", ret); > > + return 0; > why do we return 0 here? So we can fall back to another method (ACPI or DT) if available > Besides > > Reviewed-by: Eric Auger Thanks! Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [PATCH v3 0/6] Add virtio-iommu built-in topology
On Wed, Aug 26, 2020 at 09:26:02AM -0400, Michael S. Tsirkin wrote: > On Fri, Aug 21, 2020 at 03:15:34PM +0200, Jean-Philippe Brucker wrote: > > Add a topology description to the virtio-iommu driver and enable x86 > > platforms. > > > > Since [v2] we have made some progress on adding ACPI support for > > virtio-iommu, which is the preferred boot method on x86. It will be a > > new vendor-agnostic table describing para-virtual topologies in a > > minimal format. However some platforms don't use either ACPI or DT for > > booting (for example microvm), and will need the alternative topology > > description method proposed here. In addition, since the process to get > > a new ACPI table will take a long time, this provides a boot method even > > to ACPI-based platforms, if only temporarily for testing and > > development. > > OK should I park this in next now? Seems appropriate ... Yes that sounds like a good idea. It could uncover new bugs since there is more automated testing happening for x86. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v3 4/6] iommu/virtio: Add topology definitions
Add struct definitions for describing endpoints managed by the virtio-iommu. When VIRTIO_IOMMU_F_TOPOLOGY is offered, an array of virtio_iommu_topo_* structures in config space describes the endpoints, identified either by their PCI BDF or their physical MMIO address. Signed-off-by: Jean-Philippe Brucker --- include/uapi/linux/virtio_iommu.h | 44 +++ 1 file changed, 44 insertions(+) diff --git a/include/uapi/linux/virtio_iommu.h b/include/uapi/linux/virtio_iommu.h index 237e36a280cb..70cba30644d5 100644 --- a/include/uapi/linux/virtio_iommu.h +++ b/include/uapi/linux/virtio_iommu.h @@ -16,6 +16,7 @@ #define VIRTIO_IOMMU_F_BYPASS 3 #define VIRTIO_IOMMU_F_PROBE 4 #define VIRTIO_IOMMU_F_MMIO5 +#define VIRTIO_IOMMU_F_TOPOLOGY6 struct virtio_iommu_range_64 { __le64 start; @@ -27,6 +28,17 @@ struct virtio_iommu_range_32 { __le32 end; }; +struct virtio_iommu_topo_config { + /* Number of topology description structures */ + __le16 count; + /* +* Offset to the first topology description structure +* (virtio_iommu_topo_*) from the start of the virtio_iommu config +* space. Aligned on 8 bytes. +*/ + __le16 offset; +}; + struct virtio_iommu_config { /* Supported page sizes */ __le64 page_size_mask; @@ -36,6 +48,38 @@ struct virtio_iommu_config { struct virtio_iommu_range_32domain_range; /* Probe buffer size */ __le32 probe_size; + struct virtio_iommu_topo_config topo_config; +}; + +#define VIRTIO_IOMMU_TOPO_PCI_RANGE0x1 +#define VIRTIO_IOMMU_TOPO_MMIO 0x2 + +struct virtio_iommu_topo_pci_range { + /* VIRTIO_IOMMU_TOPO_PCI_RANGE */ + __u8type; + __u8reserved; + /* Length of this structure */ + __le16 length; + /* First endpoint ID in the range */ + __le32 endpoint_start; + /* PCI domain number */ + __le16 segment; + /* PCI Bus:Device.Function range */ + __le16 bdf_start; + __le16 bdf_end; + __le16 padding; +}; + +struct virtio_iommu_topo_mmio { + /* VIRTIO_IOMMU_TOPO_MMIO */ + __u8type; + __u8reserved; + /* Length of this structure */ + __le16 length; + /* Endpoint ID */ + __le32 endpoint; + /* Address of the first MMIO region */ + __le64 address; }; /* Request types */ -- 2.28.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v3 1/6] iommu/virtio: Move to drivers/iommu/virtio/
Before adding new files to the virtio-iommu driver, move it to its own subfolder, similarly to other IOMMU drivers. Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/Makefile| 3 +-- drivers/iommu/virtio/Makefile | 2 ++ drivers/iommu/{ => virtio}/virtio-iommu.c | 0 MAINTAINERS | 2 +- 4 files changed, 4 insertions(+), 3 deletions(-) create mode 100644 drivers/iommu/virtio/Makefile rename drivers/iommu/{ => virtio}/virtio-iommu.c (100%) diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index 11f1771104f3..fc7523042512 100644 --- a/drivers/iommu/Makefile +++ b/drivers/iommu/Makefile @@ -1,5 +1,5 @@ # SPDX-License-Identifier: GPL-2.0 -obj-y += amd/ intel/ arm/ +obj-y += amd/ intel/ arm/ virtio/ obj-$(CONFIG_IOMMU_API) += iommu.o obj-$(CONFIG_IOMMU_API) += iommu-traces.o obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o @@ -26,4 +26,3 @@ obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o obj-$(CONFIG_S390_IOMMU) += s390-iommu.o obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o -obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o diff --git a/drivers/iommu/virtio/Makefile b/drivers/iommu/virtio/Makefile new file mode 100644 index ..279368fcc074 --- /dev/null +++ b/drivers/iommu/virtio/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio/virtio-iommu.c similarity index 100% rename from drivers/iommu/virtio-iommu.c rename to drivers/iommu/virtio/virtio-iommu.c diff --git a/MAINTAINERS b/MAINTAINERS index deaafb617361..3602b223c9b2 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18451,7 +18451,7 @@ VIRTIO IOMMU DRIVER M: Jean-Philippe Brucker L: virtualizat...@lists.linux-foundation.org S: Maintained -F: drivers/iommu/virtio-iommu.c +F: drivers/iommu/virtio/ F: include/uapi/linux/virtio_iommu.h VIRTIO MEM DRIVER -- 2.28.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v3 5/6] iommu/virtio: Support topology description in config space
Platforms without device-tree nor ACPI can provide a topology description embedded into the virtio config space. Parse it. Use PCI FIXUP to probe the config space early, because we need to discover the topology before any DMA configuration takes place, and the virtio driver may be loaded much later. Since we discover the topology description when probing the PCI hierarchy, the virtual IOMMU cannot manage other platform devices discovered earlier. Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/Kconfig | 12 ++ drivers/iommu/virtio/Makefile | 1 + drivers/iommu/virtio/topology.c | 259 3 files changed, 272 insertions(+) create mode 100644 drivers/iommu/virtio/topology.c diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index e29ae50f7100..98d28fdbc19a 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -394,4 +394,16 @@ config VIRTIO_IOMMU config VIRTIO_IOMMU_TOPOLOGY_HELPERS bool +config VIRTIO_IOMMU_TOPOLOGY + bool "Handle topology properties from the virtio-iommu" + depends on VIRTIO_IOMMU + depends on PCI + default y + select VIRTIO_IOMMU_TOPOLOGY_HELPERS + help + Enable early probing of virtio-iommu devices to detect the built-in + topology description. + + Say Y here if you intend to run this kernel as a guest. + endif # IOMMU_SUPPORT diff --git a/drivers/iommu/virtio/Makefile b/drivers/iommu/virtio/Makefile index b42ad47eac7e..1eda8ca1cbbf 100644 --- a/drivers/iommu/virtio/Makefile +++ b/drivers/iommu/virtio/Makefile @@ -1,3 +1,4 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o +obj-$(CONFIG_VIRTIO_IOMMU_TOPOLOGY) += topology.o obj-$(CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS) += topology-helpers.o diff --git a/drivers/iommu/virtio/topology.c b/drivers/iommu/virtio/topology.c new file mode 100644 index ..4923eec618b9 --- /dev/null +++ b/drivers/iommu/virtio/topology.c @@ -0,0 +1,259 @@ +// SPDX-License-Identifier: GPL-2.0 +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "topology-helpers.h" + +struct viommu_cap_config { + u8 bar; + u32 length; /* structure size */ + u32 offset; /* structure offset within the bar */ +}; + +struct viommu_topo_header { + u8 type; + u8 reserved; + u16 length; +}; + +static struct virt_topo_endpoint * +viommu_parse_node(void __iomem *buf, size_t len) +{ + int ret = -EINVAL; + union { + struct viommu_topo_header hdr; + struct virtio_iommu_topo_pci_range pci; + struct virtio_iommu_topo_mmio mmio; + } __iomem *cfg = buf; + struct virt_topo_endpoint *spec; + + spec = kzalloc(sizeof(*spec), GFP_KERNEL); + if (!spec) + return ERR_PTR(-ENOMEM); + + switch (ioread8(>hdr.type)) { + case VIRTIO_IOMMU_TOPO_PCI_RANGE: + if (len < sizeof(cfg->pci)) + goto err_free; + + spec->dev_id.type = VIRT_TOPO_DEV_TYPE_PCI; + spec->dev_id.segment = ioread16(>pci.segment); + spec->dev_id.bdf_start = ioread16(>pci.bdf_start); + spec->dev_id.bdf_end = ioread16(>pci.bdf_end); + spec->endpoint_id = ioread32(>pci.endpoint_start); + break; + case VIRTIO_IOMMU_TOPO_MMIO: + if (len < sizeof(cfg->mmio)) + goto err_free; + + spec->dev_id.type = VIRT_TOPO_DEV_TYPE_MMIO; + spec->dev_id.base = ioread64(>mmio.address); + spec->endpoint_id = ioread32(>mmio.endpoint); + break; + default: + pr_warn("unhandled format 0x%x\n", ioread8(>hdr.type)); + ret = 0; + goto err_free; + } + return spec; + +err_free: + kfree(spec); + return ERR_PTR(ret); +} + +static int viommu_parse_topology(struct device *dev, +struct virtio_iommu_config __iomem *cfg, +size_t max_len) +{ + int ret; + u16 len; + size_t i; + LIST_HEAD(endpoints); + size_t offset, count; + struct virt_topo_iommu *viommu; + struct virt_topo_endpoint *ep, *next; + struct viommu_topo_header __iomem *cur; + + offset = ioread16(>topo_config.offset); + count = ioread16(>topo_config.count); + if (!offset || !count) + return 0; + + viommu = kzalloc(sizeof(*viommu), GFP_KERNEL); + if (!viommu) + return -ENOMEM; + + viommu->dev = dev; + + for (i = 0; i < count; i++, offset += len) { + if (offset + sizeof(*cur) > max_len) { + ret
[virtio-dev] [PATCH v3 0/6] Add virtio-iommu built-in topology
Add a topology description to the virtio-iommu driver and enable x86 platforms. Since [v2] we have made some progress on adding ACPI support for virtio-iommu, which is the preferred boot method on x86. It will be a new vendor-agnostic table describing para-virtual topologies in a minimal format. However some platforms don't use either ACPI or DT for booting (for example microvm), and will need the alternative topology description method proposed here. In addition, since the process to get a new ACPI table will take a long time, this provides a boot method even to ACPI-based platforms, if only temporarily for testing and development. v3: * Add patch 1 that moves virtio-iommu to a subfolder. * Split the rest: * Patch 2 adds topology-helper.c, which will be shared with the ACPI support. * Patch 4 adds definitions. * Patch 5 adds parser in topology.c. * Address other comments. Linux and QEMU patches available at: https://jpbrucker.net/git/linux virtio-iommu/devel https://jpbrucker.net/git/qemu virtio-iommu/devel [spec] https://lists.oasis-open.org/archives/virtio-dev/202008/msg00067.html [v2] https://lore.kernel.org/linux-iommu/20200228172537.377327-1-jean-phili...@linaro.org/ [v1] https://lore.kernel.org/linux-iommu/20200214160413.1475396-1-jean-phili...@linaro.org/ [rfc] https://lore.kernel.org/linux-iommu/20191122105000.800410-1-jean-phili...@linaro.org/ Jean-Philippe Brucker (6): iommu/virtio: Move to drivers/iommu/virtio/ iommu/virtio: Add topology helpers PCI: Add DMA configuration for virtual platforms iommu/virtio: Add topology definitions iommu/virtio: Support topology description in config space iommu/virtio: Enable x86 support drivers/iommu/Kconfig | 18 +- drivers/iommu/Makefile| 3 +- drivers/iommu/virtio/Makefile | 4 + drivers/iommu/virtio/topology-helpers.h | 50 + include/linux/virt_iommu.h| 15 ++ include/uapi/linux/virtio_iommu.h | 44 drivers/iommu/virtio/topology-helpers.c | 196 drivers/iommu/virtio/topology.c | 259 ++ drivers/iommu/{ => virtio}/virtio-iommu.c | 4 + drivers/pci/pci-driver.c | 5 + MAINTAINERS | 3 +- 11 files changed, 597 insertions(+), 4 deletions(-) create mode 100644 drivers/iommu/virtio/Makefile create mode 100644 drivers/iommu/virtio/topology-helpers.h create mode 100644 include/linux/virt_iommu.h create mode 100644 drivers/iommu/virtio/topology-helpers.c create mode 100644 drivers/iommu/virtio/topology.c rename drivers/iommu/{ => virtio}/virtio-iommu.c (99%) -- 2.28.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v3 3/6] PCI: Add DMA configuration for virtual platforms
Hardware platforms usually describe the IOMMU topology using either device-tree pointers or vendor-specific ACPI tables. For virtual platforms that don't provide a device-tree, the virtio-iommu device contains a description of the endpoints it manages. That information allows us to probe endpoints after the IOMMU is probed (possibly as late as userspace modprobe), provided it is discovered early enough. Add a hook to pci_dma_configure(), which returns -EPROBE_DEFER if the endpoint is managed by a vIOMMU that will be loaded later, or 0 in any other case to avoid disturbing the normal DMA configuration methods. When CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS isn't selected, the call to virt_dma_configure() is compiled out. As long as the information is consistent, platforms can provide both a device-tree and a built-in topology, and the IOMMU infrastructure is able to deal with multiple DMA configuration methods. Acked-by: Bjorn Helgaas Signed-off-by: Jean-Philippe Brucker --- drivers/pci/pci-driver.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 449466f71040..dbe9d33606b0 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -19,6 +19,7 @@ #include #include #include +#include #include "pci.h" #include "pcie/portdrv.h" @@ -1605,6 +1606,10 @@ static int pci_dma_configure(struct device *dev) struct device *bridge; int ret = 0; + ret = virt_dma_configure(dev); + if (ret) + return ret; + bridge = pci_get_host_bridge_device(to_pci_dev(dev)); if (IS_ENABLED(CONFIG_OF) && bridge->parent && -- 2.28.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v3 2/6] iommu/virtio: Add topology helpers
To support topology description from ACPI and from the builtin description, add helpers to keep track of I/O topology descriptors. To ease re-use of the helpers by other drivers and future ACPI extensions, use the "virt_" prefix rather than "virtio_" when naming structs and functions. Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/Kconfig | 3 + drivers/iommu/virtio/Makefile | 1 + drivers/iommu/virtio/topology-helpers.h | 50 ++ include/linux/virt_iommu.h | 15 ++ drivers/iommu/virtio/topology-helpers.c | 196 drivers/iommu/virtio/virtio-iommu.c | 4 + MAINTAINERS | 1 + 7 files changed, 270 insertions(+) create mode 100644 drivers/iommu/virtio/topology-helpers.h create mode 100644 include/linux/virt_iommu.h create mode 100644 drivers/iommu/virtio/topology-helpers.c diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index bef5d75e306b..e29ae50f7100 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -391,4 +391,7 @@ config VIRTIO_IOMMU Say Y here if you intend to run this kernel as a guest. +config VIRTIO_IOMMU_TOPOLOGY_HELPERS + bool + endif # IOMMU_SUPPORT diff --git a/drivers/iommu/virtio/Makefile b/drivers/iommu/virtio/Makefile index 279368fcc074..b42ad47eac7e 100644 --- a/drivers/iommu/virtio/Makefile +++ b/drivers/iommu/virtio/Makefile @@ -1,2 +1,3 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o +obj-$(CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS) += topology-helpers.o diff --git a/drivers/iommu/virtio/topology-helpers.h b/drivers/iommu/virtio/topology-helpers.h new file mode 100644 index ..436ca6a900c5 --- /dev/null +++ b/drivers/iommu/virtio/topology-helpers.h @@ -0,0 +1,50 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef TOPOLOGY_HELPERS_H_ +#define TOPOLOGY_HELPERS_H_ + +#ifdef CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS + +/* Identify a device node in the topology */ +struct virt_topo_dev_id { + unsigned inttype; +#define VIRT_TOPO_DEV_TYPE_PCI 1 +#define VIRT_TOPO_DEV_TYPE_MMIO2 + union { + /* PCI endpoint or range */ + struct { + u16 segment; + u16 bdf_start; + u16 bdf_end; + }; + /* MMIO region */ + u64 base; + }; +}; + +/* Specification of an IOMMU */ +struct virt_topo_iommu { + struct virt_topo_dev_id dev_id; + struct device *dev; /* transport device */ + struct fwnode_handle*fwnode; + struct iommu_ops*ops; + struct list_headlist; +}; + +/* Specification of an endpoint */ +struct virt_topo_endpoint { + struct virt_topo_dev_id dev_id; + u32 endpoint_id; + struct virt_topo_iommu *viommu; + struct list_headlist; +}; + +void virt_topo_add_endpoint(struct virt_topo_endpoint *ep); +void virt_topo_add_iommu(struct virt_topo_iommu *viommu); + +void virt_topo_set_iommu_ops(struct device *dev, struct iommu_ops *ops); + +#else /* !CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS */ +static inline void virt_topo_set_iommu_ops(struct device *dev, struct iommu_ops *ops) +{ } +#endif /* !CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS */ +#endif /* TOPOLOGY_HELPERS_H_ */ diff --git a/include/linux/virt_iommu.h b/include/linux/virt_iommu.h new file mode 100644 index ..17d2bd4732e0 --- /dev/null +++ b/include/linux/virt_iommu.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef VIRT_IOMMU_H_ +#define VIRT_IOMMU_H_ + +#ifdef CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS +int virt_dma_configure(struct device *dev); + +#else /* !CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS */ +static inline int virt_dma_configure(struct device *dev) +{ + /* Don't disturb the normal DMA configuration methods */ + return 0; +} +#endif /* !CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS */ +#endif /* VIRT_IOMMU_H_ */ diff --git a/drivers/iommu/virtio/topology-helpers.c b/drivers/iommu/virtio/topology-helpers.c new file mode 100644 index ..8815e3a5d431 --- /dev/null +++ b/drivers/iommu/virtio/topology-helpers.c @@ -0,0 +1,196 @@ +// SPDX-License-Identifier: GPL-2.0 +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include + +#include "topology-helpers.h" + +static LIST_HEAD(viommus); +static LIST_HEAD(pci_endpoints); +static LIST_HEAD(mmio_endpoints); +static DEFINE_MUTEX(viommus_lock); + +static bool virt_topo_device_match(struct device *dev, + struct virt_topo_dev_id *id) +{ + if (id->type == VIRT_TOPO_DEV_TYPE_PCI && dev_is_pci(dev)) { +
[virtio-dev] [PATCH v3 6/6] iommu/virtio: Enable x86 support
With the built-in topology description in place, x86 platforms can now use the virtio-iommu. Architectures that use the generic iommu_dma_ops should normally select CONFIG_IOMMU_DMA themselves (from arch/*/Kconfig). Since not all x86 drivers have been converted yet, it's currently up to the IOMMU Kconfig to select it. Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/Kconfig | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 98d28fdbc19a..d7cf158745eb 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -383,8 +383,9 @@ config HYPERV_IOMMU config VIRTIO_IOMMU tristate "Virtio IOMMU driver" depends on VIRTIO - depends on ARM64 + depends on (ARM64 || X86) select IOMMU_API + select IOMMU_DMA if X86 select INTERVAL_TREE help Para-virtualised IOMMU driver with virtio. -- 2.28.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH virtio v2] virtio-iommu: Add built-in topology description
Add a simple method to describe the IOMMU topology in the config space, guarded by a new feature bit. A list of capabilities in the config space describes the devices managed by the IOMMU and their endpoint IDs. As outlined in paragraph #1 below, the method used to describe I/O topology depends on the platform. Device-Tree based platforms use the "iommus" and "iommu-map" DT properties, while ACPI based platforms will use the VIOT table (whose format is compatible with this built-in description, but is under discussion). This adds a method for lightweight platforms that don't use either DT or ACPI. Signed-off-by: Jean-Philippe Brucker --- v2: * Precise what \field{length} should contain (Eric) * Rename num_items -> count v1: https://lists.oasis-open.org/archives/virtio-dev/202007/msg00023.html --- virtio-iommu.tex | 102 +++ 1 file changed, 102 insertions(+) diff --git a/virtio-iommu.tex b/virtio-iommu.tex index 08b358a..82254b1 100644 --- a/virtio-iommu.tex +++ b/virtio-iommu.tex @@ -67,6 +67,9 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits} \item[VIRTIO_IOMMU_F_MMIO (5)] The VIRTIO_IOMMU_MAP_F_MMIO flag is available. + +\item[VIRTIO_IOMMU_F_TOPOLOGY (6)] + Topology description is available at \field{topo_config}. \end{description} \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits} @@ -97,6 +100,10 @@ \subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / le32 end; } domain_range; le32 probe_size; + struct virtio_iommu_topo_config { +le16 count; +le16 offset; + } topo_config; }; \end{lstlisting} @@ -139,6 +146,101 @@ \subsection{Device initialization}\label{sec:Device Types / IOMMU Device / Devic If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the device SHOULD NOT let endpoints access the guest-physical address space. +\subsubsection{Built-in topology description}\label{sec:Device Types / IOMMU Device / Device initialization / topology} + +The device manages memory accesses from endpoints, identified by endpoint +IDs. The driver can discover which endpoint ID corresponds to an endpoint +using a method specific to the platform. Platforms described with device +tree use the \texttt{iommus} and \texttt{iommu-map} properties embedded +into device nodes for this purpose. Platforms described with ACPI use a +table such as the Virtual I/O Translation Table (VIOT). Platforms that do +not support either device tree or ACPI may embed a minimalistic +description in the device configuration space, declared by the +VIRTIO_IOMMU_F_TOPOLOGY feature. + +A disadvantage of describing the topology from within the device is the +lack of initialization ordering information. Out-of-band descriptions such +as device tree and ACPI let the operating system know about device +dependencies so that it can initialize supplier devices (IOMMUs) before +their consumers (endpoints). Platforms using the VIRTIO_IOMMU_F_TOPOLOGY +feature have to communicate the device dependency in another way. + +If the VIRTIO_IOMMU_F_TOPOLOGY feature is negotiated, \field{topo_config} +describes the topology description array. \field{topo_config.offset} is +the offset between the beginning of the device-specific configuration +space (struct virtio_iommu_config) and the array. The array is composed of +\field{topo_config.count} topology structures. A topology structure +defines the endpoint ID of one or more endpoints managed by the +virtio-iommu device. Each structure has a \field{length} field, which +defines the offset from the beginning of the structure to the beginning of +the next one. Each structure starts with a \field{type} byte: + +\begin{description} + \item[VIRTIO_IOMMU_TOPO_PCI_RANGE (1)] struct virtio_iommu_topo_pci_range + \item[VIRTIO_IOMMU_TOPO_MMIO (2)] struct virtio_iommu_topo_mmio +\end{description} + +To stay binary compatible with ACPI VIOT, types 3 and 4 are currently +reserved. + +\paragraph{PCI range}\label{sec:Device Types / IOMMU Device / Device initialization / topology / PCI range} + +\begin{lstlisting} +struct virtio_iommu_topo_pci_range { + u8 type; + u8 reserved; + le16 length; + le32 endpoint_start; + le16 segment; + le16 bdf_start; + le16 bdf_end; + le16 padding; +}; +\end{lstlisting} + +The PCI range structure describes the endpoint IDs of a series of PCI +devices identified by their BDF (Bus-Device-Function) number. + +\begin{description} + \item[\field{type}] VIRTIO_IOMMU_TOPO_PCI_RANGE. + \item[\field{length}] Length of the structure in bytes. + \item[\field{endpoint_start}] First endpoint ID. + \item[\field{segment}] Identifier of the PCI hierarchy. Sometimes called +domain number. + \item[\field{bdf_start}] First BDF in the range. + \item[\field{bdf_end}] Last BDF in the range. +\end{description} + +The correspondence between a PCI BDF in the range +[ bdf_s
[virtio-dev] Re: [PATCH virtio] virtio-iommu: Add built-in topology description
On Wed, Jul 15, 2020 at 10:52:30AM +0200, Auger Eric wrote: > > +If the VIRTIO_IOMMU_F_TOPOLOGY feature is negotiated, \field{topo_config} > > +describes the topology description array. \field{topo_config.offset} is > > +the offset between the beginning of the device-specific configuration > > +space (struct virtio_iommu_config) and the array. The array is composed of > > +\field{topo_config.num_items} topology structures. A topology structure > > +defines the endpoint ID of one or more endpoints managed by the > > +virtio-iommu device. Each structure has a \field{length} field, which > > +defines the offset to the next structure. > > If this is an offset, then you may precise what the offset is relative > to (I guess the beginning of the current topology structure). Ok, changing it to "which defines the offset from the beginning of the structure to the beginning of the next one". > Otherwise > the structure only has fixed size elements so is its length expected to > vary? Length is only for backward compatibility: if a future version of the spec appends new elements to the structure, then the current driver can ignore those elements and jump to the next structure (like argsz in VFIO, and the PROBE property length). Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH virtio] virtio-iommu: Add built-in topology description
Add a simple method to describe the IOMMU topology in the config space, guarded by a new feature bit. A list of capabilities in the config space describes the devices managed by the IOMMU and their endpoint IDs. As outlined in paragraph #1 below, the method used to describe I/O topology depends on the platform. Device-Tree based platforms use the "iommus" and "iommu-map" DT properties, while ACPI based platforms will use the VIOT table (whose format is compatible with this built-in description, but is under discussion). This adds a method for lightweight platforms that don't use either DT or ACPI. Signed-off-by: Jean-Philippe Brucker --- virtio-iommu.tex | 102 +++ 1 file changed, 102 insertions(+) diff --git a/virtio-iommu.tex b/virtio-iommu.tex index 08b358a..a1de2a8 100644 --- a/virtio-iommu.tex +++ b/virtio-iommu.tex @@ -67,6 +67,9 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits} \item[VIRTIO_IOMMU_F_MMIO (5)] The VIRTIO_IOMMU_MAP_F_MMIO flag is available. + +\item[VIRTIO_IOMMU_F_TOPOLOGY (6)] + Topology description is available at \field{topo_config}. \end{description} \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits} @@ -97,6 +100,10 @@ \subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / le32 end; } domain_range; le32 probe_size; + struct virtio_iommu_topo_config { +le16 num_items; +le16 offset; + } topo_config; }; \end{lstlisting} @@ -139,6 +146,101 @@ \subsection{Device initialization}\label{sec:Device Types / IOMMU Device / Devic If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the device SHOULD NOT let endpoints access the guest-physical address space. +\subsubsection{Built-in topology description}\label{sec:Device Types / IOMMU Device / Device initialization / topology} + +The device manages memory accesses from endpoints, identified by endpoint +IDs. The driver can discover which endpoint ID corresponds to an endpoint +using a method specific to the platform. Platforms described with device +tree use the \texttt{iommus} and \texttt{iommu-map} properties embedded +into device nodes for this purpose. Platforms described with ACPI use a +table such as the Virtual I/O Translation Table (VIOT). Platforms that do +not support either device tree or ACPI may embed a minimalistic +description in the device configuration space, declared by the +VIRTIO_IOMMU_F_TOPOLOGY feature. + +A disadvantage of describing the topology from within the device is the +lack of initialization ordering information. Out-of-band descriptions such +as device tree and ACPI let the operating system know about device +dependencies so that it can initialize supplier devices (IOMMUs) before +their consumers (endpoints). Platforms using the VIRTIO_IOMMU_F_TOPOLOGY +feature have to communicate the device dependency in another way. + +If the VIRTIO_IOMMU_F_TOPOLOGY feature is negotiated, \field{topo_config} +describes the topology description array. \field{topo_config.offset} is +the offset between the beginning of the device-specific configuration +space (struct virtio_iommu_config) and the array. The array is composed of +\field{topo_config.num_items} topology structures. A topology structure +defines the endpoint ID of one or more endpoints managed by the +virtio-iommu device. Each structure has a \field{length} field, which +defines the offset to the next structure. Each structure starts with a +\field{type} byte: + +\begin{description} + \item[VIRTIO_IOMMU_TOPO_PCI_RANGE (1)] struct virtio_iommu_topo_pci_range + \item[VIRTIO_IOMMU_TOPO_MMIO (2)] struct virtio_iommu_topo_mmio +\end{description} + +To stay binary compatible with ACPI VIOT, types 3 and 4 are currently +reserved. + +\paragraph{PCI range}\label{sec:Device Types / IOMMU Device / Device initialization / topology / PCI range} + +\begin{lstlisting} +struct virtio_iommu_topo_pci_range { + u8 type; + u8 reserved; + le16 length; + le32 endpoint_start; + le16 segment; + le16 bdf_start; + le16 bdf_end; + le16 padding; +}; +\end{lstlisting} + +The PCI range structure describes the endpoint IDs of a series of PCI +devices identified by their BDF (Bus-Device-Function) number. + +\begin{description} + \item[\field{type}] VIRTIO_IOMMU_TOPO_PCI_RANGE. + \item[\field{length}] Length of the structure in bytes. + \item[\field{endpoint_start}] First endpoint ID. + \item[\field{segment}] Identifier of the PCI hierarchy. Sometimes called +domain number. + \item[\field{bdf_start}] First BDF in the range. + \item[\field{bdf_end}] Last BDF in the range. +\end{description} + +The correspondence between a PCI BDF in the range +[ bdf_start; bdf_end ] and its endpoint IDs is a linear transformation: +endpoint_id = bdf - bdf_start + endpoint_start. + +\paragraph{Single endpoint}\label{sec:Device Types / IOMMU Device / Device initialization
[virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
On Wed, Jun 17, 2020 at 06:31:15PM +0100, Alex Bennée wrote: [...] > Option 2 - Additional Platform Data > === > > This would be extending using something like device tree or ACPI tables > which could define regions of memory that would inform the low level > memory allocation routines where they could allocate from. There is > already of the concept of "dma-ranges" in device tree which can be a > per-device property which defines the region of space that is DMA > coherent for a device. They are regions that are accessible to a device for DMA, coherency is described through other methods. Thinking more about this, dma-ranges (and ACPI _DMA) don't exactly describe what you need. They describe addressing limitation from a bridge's perspective, for example from the PCI root complex. So there are at least two issues: 1. They apply to the whole downstream bus, so you can't define per-device DMA windows. Although with PCIe I suppose you could put one on each downstream port. 2. More importantly, they only describe addressing limitations locally. When the device directly accesses memory, it emits guest-physical addresses (GPA) so you can use DMA ranges to describe which memory it can access. However, if there is an IOMMU in between, the device emits I/O virtual addresses (IOVA), which are translated by the IOMMU into GPA. In this case the DMA ranges apply to the IOVA, and there doesn't exist a way to describe limitations on the GPA. There are other mechanisms describing addressing limitations such as Intel's RMRR, but those also apply to IOVAs as far as I know. Thanks, Jean > > There is the question of how you tie regions declared here with the > eventual instantiating of the VirtIO devices? > > For a fully distributed set of backends (one backend per device per > worker VM) you would need several different regions. Would each region > be tied to each device or just a set of areas the guest would allocate > from in sequence? - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [PATCH v6] iommu/virtio: Use page size bitmap supported by endpoint
On Thu, May 14, 2020 at 05:31:00AM -0400, Michael S. Tsirkin wrote: > On Thu, May 14, 2020 at 01:22:37PM +0530, Bharat Bhushan wrote: > > Different endpoint can support different page size, probe > > endpoint if it supports specific page size otherwise use > > global page sizes. > > > > Device attached to domain should support a minimum of > > domain supported page sizes. If device supports more > > than domain supported page sizes then device is limited > > to use domain supported page sizes only. > > OK so I am just trying to figure it out. > Before the patch, we always use the domain supported page sizes > right? > > With the patch, we still do, but we also probe and > validate that device supports all domain page sizes, > if it does not then we fail to attach the device. Generally there is one endpoint per domain. Linux creates the domains and decides which endpoint goes in which domain. It puts multiple endpoints in a domain in two cases: * If endpoints cannot be isolated from each others by the IOMMU, for example if ACS isolation isn't enabled in PCIe. In that case endpoints are in the same IOMMU group, and therefore contained in the same domain. This is more of a quirk for broken hardware, and isn't much of a concern for virtualization because it's easy for the hypervisor to present endpoints isolated from each others. * If userspace wants to put endpoints in the same VFIO container, then VFIO first attempts to put them in the same IOMMU domain, and falls back to multiple domains if that fails. That case is just a luxury and we shouldn't over-complicate the driver to cater for this. So the attach checks don't need to be that complicated. Checking that the page masks are exactly the same should be enough. > This seems like a lot of effort for little benefit, can't > hypervisor simply make sure endpoints support the > iommu page sizes for us? I tend to agree, it's not very likely that we'll have a configuration with different page sizes between physical and virtual endpoints. If there is a way for QEMU to simply reject VFIO devices that don't use the same page mask as what's configured globally, let's do that instead of introducing the page_size_mask property. > > @@ -615,7 +636,7 @@ static int viommu_domain_finalise(struct > > viommu_endpoint *vdev, > > struct viommu_dev *viommu = vdev->viommu; > > struct viommu_domain *vdomain = to_viommu_domain(domain); > > > > - viommu_page_size = 1UL << __ffs(viommu->pgsize_bitmap); > > + viommu_page_size = 1UL << __ffs(vdev->pgsize_bitmap); > > if (viommu_page_size > PAGE_SIZE) { > > dev_err(vdev->dev, > > "granule 0x%lx larger than system page size 0x%lx\n", > > > Looks like this is messed up on 32 bit: e.g. 0x1 will try to do > 1UL << -1, which is undefined behaviour. Which is btw already messed up > wrt viommu->pgsize_bitmap, but that's not a reason to propagate > the error. Realistically we're not going to have a page granule larger than 4G, it's going to be 4k or 64k. But we can add a check that truncates the page_size_mask to 32-bit and makes sure that it's non-null. > > +struct virtio_iommu_probe_pgsize_mask { > > + struct virtio_iommu_probe_property head; > > + __u8reserved[4]; > > + /* Same format as virtio_iommu_config::page_size_mask */ > > It's actually slightly different in that > this must be a superset of domain page size mask, right? No it overrides the global mask > > + __le64 pgsize_bitmap; Bharat, please rename this to page_size_mask for consistency Thanks, Jean > > +}; > > + > > #define VIRTIO_IOMMU_RESV_MEM_T_RESERVED 0 > > #define VIRTIO_IOMMU_RESV_MEM_T_MSI1 > > > > -- > > 2.17.1 > - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v2] virtio-iommu: Add PAGE_SIZE_MASK property
Add a PROBE property to declare the mapping granularity per endpoint. The virtio-iommu device already has to declare a granule in its config space, but when endpoints are behind different physical IOMMUs (for example some pass-through and some PV endpoints), they may have different mapping granules. This new property allows to override the global page_size_mask for each endpoint. When the property isn't reported for an endpoint, the global page size mask applies. The device checks page alignment during MAP requests. Signed-off-by: Jean-Philippe Brucker --- v1: https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg06035.html v2 clarifies how overriding the global property works. --- virtio-iommu.tex | 38 +- 1 file changed, 37 insertions(+), 1 deletion(-) diff --git a/virtio-iommu.tex b/virtio-iommu.tex index 08b358a..05f4fd6 100644 --- a/virtio-iommu.tex +++ b/virtio-iommu.tex @@ -216,6 +216,9 @@ \subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device op Creating mappings aligned on large page sizes can improve performance since they require fewer page table and TLB entries. + The mask can be overriden for an endpoint by the PAGE_SIZE_MASK PROBE + property. + \item If the VIRTIO_IOMMU_F_DOMAIN_RANGE feature is offered, \field{domain_range} describes the values supported in a \field{domain} field. If the feature is not offered, any \field{domain} value is valid. @@ -664,7 +667,8 @@ \subsubsection{PROBE request}\label{sec:Device Types / IOMMU Device / Device ope \subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties} \begin{lstlisting} -#define VIRTIO_IOMMU_PROBE_T_RESV_MEM 1 +#define VIRTIO_IOMMU_PROBE_T_RESV_MEM 1 +#define VIRTIO_IOMMU_PROBE_T_PAGE_SIZE_MASK 2 \end{lstlisting} \paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM} @@ -727,6 +731,38 @@ \subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device The device SHOULD NOT allow accesses from the endpoint to RESV_MEM regions to affect any other component than the endpoint and the driver. +\paragraph{Property PAGE_SIZE_MASK}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / PAGESIZEMASK} + +The PAGE_SIZE_MASK property overrides the global \field{page_size_mask} +configuration for an endpoint. + +\begin{lstlisting} +struct virtio_iommu_probe_page_size_mask { + struct virtio_iommu_probe_property head; + u8reserved[4]; + le64 page_size_mask; +}; +\end{lstlisting} + +The \field{page_size_mask} field behaves in the same way as the global +\field{page_size_mask} field. The least significant bit describes the +mapping granularity, and additional bits are hints (see \ref{sec:Device +Types / IOMMU Device / Device operations}). When no PAGE_SIZE_MASK +property is available for an endpoint, the global \field{page_size_mask} +applies. + +\drivernormative{\subparagraph}{Property PAGE_SIZE_MASK}{Device Types / IOMMU Device / Device operations / PROBE properties / PAGESIZEMASK} + +The driver MUST ignore \field{reserved}. + +\devicenormative{\subparagraph}{Property PAGE_SIZE_MASK}{Device Types / IOMMU Device / Device operations / PROBE properties / PAGESIZEMASK} + +The device SHOULD set \field{reserved} to zero. + +The device MUST set at least one bit in \field{page_size_mask}, describing +the page granularity. The device MAY set more than one bit in +\field{page_size_mask}. + \subsubsection{Fault reporting}\label{sev:Device Types / IOMMU Device / Device operations / Fault reporting} The device can report translation faults and other significant -- 2.26.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] Re: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add PAGE_SIZE_MASK property
On Fri, Mar 27, 2020 at 12:05:42PM +0100, Auger Eric wrote: > Hi Jean, Bharat, > > On 3/27/20 10:35 AM, Jean-Philippe Brucker wrote: > > On Fri, Mar 27, 2020 at 09:17:43AM +, Bharat Bhushan wrote: > >> > >> Sent again, somehow the email-address got corrupted. > >> > >>> -Original Message- > >>> From: Bharat Bhushan > >>> Sent: Friday, March 27, 2020 2:46 PM > >>> To: @mx0a-0016f401.pphosted.com > >>> Cc: Auger Eric ; virtio-dev@lists.oasis-open.org > >>> Subject: RE: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add > >>> PAGE_SIZE_MASK property > >>> > >>> Hi Jean, > >>> > >>>> -Original Message- > >>>> From: Jean-Philippe Brucker > >>>> Sent: Friday, March 27, 2020 2:31 PM > >>>> To: Bharat Bhushan > >>>> Cc: Auger Eric ; > >>>> virtio-dev@lists.oasis-open.org > >>>> Subject: Re: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add > >>>> PAGE_SIZE_MASK property > >>>> > >>>> On Fri, Mar 27, 2020 at 05:20:56AM +, Bharat Bhushan wrote: > >>>>> Hi Jean, > >>>>> > >>>>>> -Original Message- > >>>>>> From: Auger Eric > >>>>>> Sent: Thursday, March 26, 2020 4:50 PM > >>>>>> To: Jean-Philippe Brucker > >>>>>> Cc: virtio-dev@lists.oasis-open.org; Bharat Bhushan > >>>>>> > >>>>>> Subject: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add > >>>>>> PAGE_SIZE_MASK property > >>>>>> > >>>>>> External Email > >>>>>> > >>>>>> -- > >>>>>> -- > >>>>>> -- > >>>>>> Hi Jean, > >>>>>> > >>>>>> On 3/26/20 11:49 AM, Jean-Philippe Brucker wrote: > >>>>>>> On Mon, Mar 23, 2020 at 03:22:40PM +0100, Auger Eric wrote: > >>>>>>>> Hi Jean, > >>>>>>>> > >>>>>>>> On 3/23/20 2:38 PM, Jean-Philippe Brucker wrote: > >>>>>>>>> Add a PROBE property to declare the mapping granularity per > >>>>>>>>> endpoint. > >>>>>>>>> The virtio-iommu device already declares a granule in its > >>>>>>>>> config space, but when endpoints are behind different physical > >>>>>>>>> IOMMUs, they may have different mapping granules. This new > >>>>>>>>> property allows to override the global page_size_mask for each > >>>>>>>>> endpoint. > >>>>>>>>> > >>>>>>>>> In the future it may be useful to describe more than one > >>>>>>>>> page_size_mask for each endpoint, and allow them to negotiate > >>>>>>>>> it during ATTACH. For example two masks could allow the driver > >>>>>>>>> to choose between 4k and 64k granule, along with their > >>>>>>>>> respective block mapping sizes. This could be added by > >>>>>>>>> replacing \field{reserved} with an array > >>>>>> length, for example. > >>>>>>>> Sorry I don't get the use case where several page size bitmaps > >>>>>>>> should be exposed. > >>>>>>> > >>>>>>> For a 4k granule you get block mappings of 2M and 1G. For a 64k > >>>>>>> granule you get 512M and 4T block mappings. If you want to > >>>>>>> communicate both options to the guest, you need two separate > >>>>>>> masks, 0x40201000 and 0x4002001. Then the guest could choose > >>>>>>> one of the granules during attach, if we add a flag to the > >>>>>>> attach request. I'm not suggesting we do that now, just trying > >>>>>>> to make sure it can be extended if anyone actually wants it. > >>>>>>> Personally I don't think it's worth adding, especially given the > >>>>>>> additional work required in > >>>> the host. > >>>>>> OK I get it now. > >>>>> > >>>
[virtio-dev] Re: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add PAGE_SIZE_MASK property
On Fri, Mar 27, 2020 at 09:17:43AM +, Bharat Bhushan wrote: > > Sent again, somehow the email-address got corrupted. > > > -Original Message- > > From: Bharat Bhushan > > Sent: Friday, March 27, 2020 2:46 PM > > To: @mx0a-0016f401.pphosted.com > > Cc: Auger Eric ; virtio-dev@lists.oasis-open.org > > Subject: RE: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add > > PAGE_SIZE_MASK property > > > > Hi Jean, > > > > > -Original Message- > > > From: Jean-Philippe Brucker > > > Sent: Friday, March 27, 2020 2:31 PM > > > To: Bharat Bhushan > > > Cc: Auger Eric ; > > > virtio-dev@lists.oasis-open.org > > > Subject: Re: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add > > > PAGE_SIZE_MASK property > > > > > > On Fri, Mar 27, 2020 at 05:20:56AM +, Bharat Bhushan wrote: > > > > Hi Jean, > > > > > > > > > -Original Message- > > > > > From: Auger Eric > > > > > Sent: Thursday, March 26, 2020 4:50 PM > > > > > To: Jean-Philippe Brucker > > > > > Cc: virtio-dev@lists.oasis-open.org; Bharat Bhushan > > > > > > > > > > Subject: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add > > > > > PAGE_SIZE_MASK property > > > > > > > > > > External Email > > > > > > > > > > -- > > > > > -- > > > > > -- > > > > > Hi Jean, > > > > > > > > > > On 3/26/20 11:49 AM, Jean-Philippe Brucker wrote: > > > > > > On Mon, Mar 23, 2020 at 03:22:40PM +0100, Auger Eric wrote: > > > > > >> Hi Jean, > > > > > >> > > > > > >> On 3/23/20 2:38 PM, Jean-Philippe Brucker wrote: > > > > > >>> Add a PROBE property to declare the mapping granularity per > > > > > >>> endpoint. > > > > > >>> The virtio-iommu device already declares a granule in its > > > > > >>> config space, but when endpoints are behind different physical > > > > > >>> IOMMUs, they may have different mapping granules. This new > > > > > >>> property allows to override the global page_size_mask for each > > > > > >>> endpoint. > > > > > >>> > > > > > >>> In the future it may be useful to describe more than one > > > > > >>> page_size_mask for each endpoint, and allow them to negotiate > > > > > >>> it during ATTACH. For example two masks could allow the driver > > > > > >>> to choose between 4k and 64k granule, along with their > > > > > >>> respective block mapping sizes. This could be added by > > > > > >>> replacing \field{reserved} with an array > > > > > length, for example. > > > > > >> Sorry I don't get the use case where several page size bitmaps > > > > > >> should be exposed. > > > > > > > > > > > > For a 4k granule you get block mappings of 2M and 1G. For a 64k > > > > > > granule you get 512M and 4T block mappings. If you want to > > > > > > communicate both options to the guest, you need two separate > > > > > > masks, 0x40201000 and 0x4002001. Then the guest could choose > > > > > > one of the granules during attach, if we add a flag to the > > > > > > attach request. I'm not suggesting we do that now, just trying > > > > > > to make sure it can be extended if anyone actually wants it. > > > > > > Personally I don't think it's worth adding, especially given the > > > > > > additional work required in > > > the host. > > > > > OK I get it now. > > > > > > > > What some clarification about two page-size-mask configurations > > > > available. > > > > - Global configuration for page-size-mask > > > > - per endpoint page-size-mask configuration > > > > > > > > PAGE_SIZE_MASK probe for and endpoint can return zero or non-zero value. > > > > If it returns non-zero value than it will override the global > > > > configuration. > > > > If PAGE_SIZE_MASK probe for and endpoint return zero value than > > > > global pa
[virtio-dev] Re: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add PAGE_SIZE_MASK property
On Fri, Mar 27, 2020 at 05:20:56AM +, Bharat Bhushan wrote: > Hi Jean, > > > -Original Message- > > From: Auger Eric > > Sent: Thursday, March 26, 2020 4:50 PM > > To: Jean-Philippe Brucker > > Cc: virtio-dev@lists.oasis-open.org; Bharat Bhushan > > Subject: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add > > PAGE_SIZE_MASK > > property > > > > External Email > > > > ------ > > Hi Jean, > > > > On 3/26/20 11:49 AM, Jean-Philippe Brucker wrote: > > > On Mon, Mar 23, 2020 at 03:22:40PM +0100, Auger Eric wrote: > > >> Hi Jean, > > >> > > >> On 3/23/20 2:38 PM, Jean-Philippe Brucker wrote: > > >>> Add a PROBE property to declare the mapping granularity per endpoint. > > >>> The virtio-iommu device already declares a granule in its config > > >>> space, but when endpoints are behind different physical IOMMUs, they > > >>> may have different mapping granules. This new property allows to > > >>> override the global page_size_mask for each endpoint. > > >>> > > >>> In the future it may be useful to describe more than one > > >>> page_size_mask for each endpoint, and allow them to negotiate it > > >>> during ATTACH. For example two masks could allow the driver to > > >>> choose between 4k and 64k granule, along with their respective block > > >>> mapping sizes. This could be added by replacing \field{reserved} with > > >>> an array > > length, for example. > > >> Sorry I don't get the use case where several page size bitmaps should > > >> be exposed. > > > > > > For a 4k granule you get block mappings of 2M and 1G. For a 64k > > > granule you get 512M and 4T block mappings. If you want to communicate > > > both options to the guest, you need two separate masks, 0x40201000 and > > > 0x4002001. Then the guest could choose one of the granules during > > > attach, if we add a flag to the attach request. I'm not suggesting we > > > do that now, just trying to make sure it can be extended if anyone > > > actually wants it. Personally I don't think it's worth adding, > > > especially given the additional work required in the host. > > OK I get it now. > > What some clarification about two page-size-mask configurations available. > - Global configuration for page-size-mask > - per endpoint page-size-mask configuration > > PAGE_SIZE_MASK probe for and endpoint can return zero or non-zero value. > If it returns non-zero value than it will override the global configuration. > If PAGE_SIZE_MASK probe for and endpoint return zero value than global > page-size-mask configuration will be used. > > Is that correct? Yes. If a PAGE_SIZE_MASK property is available for an endpoint, the driver should use that mask. Otherwise it should use the global mask, which is always provided. I wonder, should we introduce some form of negotiation now? If the driver doesn't know about the new probe property, it will use the global mask. At some point it will send a MAP request not aligned on the page granule, and the device will abort the request. If instead we add a flag and page mask field to the attach request, the device would know that the driver didn't understand the per-endpoint page mask and abort the attach. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add PAGE_SIZE_MASK property
On Mon, Mar 23, 2020 at 03:22:40PM +0100, Auger Eric wrote: > Hi Jean, > > On 3/23/20 2:38 PM, Jean-Philippe Brucker wrote: > > Add a PROBE property to declare the mapping granularity per endpoint. > > The virtio-iommu device already declares a granule in its config space, > > but when endpoints are behind different physical IOMMUs, they may have > > different mapping granules. This new property allows to override the > > global page_size_mask for each endpoint. > > > > In the future it may be useful to describe more than one page_size_mask > > for each endpoint, and allow them to negotiate it during ATTACH. For > > example two masks could allow the driver to choose between 4k and 64k > > granule, along with their respective block mapping sizes. This could be > > added by replacing \field{reserved} with an array length, for example. > Sorry I don't get the use case where several page size bitmaps should be > exposed. For a 4k granule you get block mappings of 2M and 1G. For a 64k granule you get 512M and 4T block mappings. If you want to communicate both options to the guest, you need two separate masks, 0x40201000 and 0x4002001. Then the guest could choose one of the granules during attach, if we add a flag to the attach request. I'm not suggesting we do that now, just trying to make sure it can be extended if anyone actually wants it. Personally I don't think it's worth adding, especially given the additional work required in the host. > > > > Signed-off-by: Jean-Philippe Brucker > > --- > > I had this change planned for a page-table sharing extension, but it > > seems to be useful for the VFIO support in QEMU as well [1]. > > > > [1] > > https://lore.kernel.org/qemu-devel/3da60c1b-6897-7ab1-3a67-bec44fa00...@redhat.com/ > > --- > > --- > > virtio-iommu.tex | 33 - > > 1 file changed, 32 insertions(+), 1 deletion(-) > > > > diff --git a/virtio-iommu.tex b/virtio-iommu.tex > > index 08b358a..d7be13d 100644 > > --- a/virtio-iommu.tex > > +++ b/virtio-iommu.tex > > @@ -664,7 +664,8 @@ \subsubsection{PROBE request}\label{sec:Device Types / > > IOMMU Device / Device ope > > \subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / > > Device operations / PROBE properties} > > > > \begin{lstlisting} > > -#define VIRTIO_IOMMU_PROBE_T_RESV_MEM 1 > > +#define VIRTIO_IOMMU_PROBE_T_RESV_MEM 1 > > +#define VIRTIO_IOMMU_PROBE_T_PAGE_SIZE_MASK 2 > > \end{lstlisting} > > > > \paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / > > Device operations / PROBE properties / RESVMEM} > > @@ -727,6 +728,36 @@ \subsubsection{PROBE properties}\label{sec:Device > > Types / IOMMU Device / Device > > The device SHOULD NOT allow accesses from the endpoint to RESV_MEM regions > > to affect any other component than the endpoint and the driver. > > > > +\paragraph{Property PAGE_SIZE_MASK}\label{sec:Device Types / IOMMU Device > > / Device operations / PROBE properties / PAGESIZEMASK} > "Property PAGESIZEMASK" to be homegeneous with "Property RESV_MEM"? I don't understand, isn't the underscore version more homogeneous? I only used PAGESIZEMASK and RESVMEM in the label to avoid troubles with latex, which complains about underscores in labels (maybe because of the underscore package). But labels are only used for resolving references and don't appear in the final text. > > + > > +The PAGE_SIZE_MASK property overrides the global \field{page_size_mask} > > +configuration for an endpoint. > > + > > +\begin{lstlisting} > > +struct virtio_iommu_probe_page_size_mask { > > + struct virtio_iommu_probe_property head; > > + u8reserved[4]; > > + le64 page_size_mask; > > +}; > > +\end{lstlisting} > > Maybe it would be useful to put in the spec the context of use of the > new property (basically what you wrote in the commit msg). Also > re-inforce that if no ep page size mask is returned the default one > applies. Chapters describing page_size_mask global field may be sligtly > augmented to mention the new prop... I'll add that if no page size mask property is present, the global page size mask applies. And in Device operations I'll add that these limits are described in the device configuration as well as probe properties. Thanks, Jean > > + > > +The \field{page_size_mask} field behaves in the same way as the global > > +\field{page_size_mask} field. The least significant bit describes the > > +mapping granularity, and additional bits are hints (see \ref{sec:Device > > +Types / IOMMU Device / D
[virtio-dev] [RFC PATCH] virtio-iommu: Add PAGE_SIZE_MASK property
Add a PROBE property to declare the mapping granularity per endpoint. The virtio-iommu device already declares a granule in its config space, but when endpoints are behind different physical IOMMUs, they may have different mapping granules. This new property allows to override the global page_size_mask for each endpoint. In the future it may be useful to describe more than one page_size_mask for each endpoint, and allow them to negotiate it during ATTACH. For example two masks could allow the driver to choose between 4k and 64k granule, along with their respective block mapping sizes. This could be added by replacing \field{reserved} with an array length, for example. Signed-off-by: Jean-Philippe Brucker --- I had this change planned for a page-table sharing extension, but it seems to be useful for the VFIO support in QEMU as well [1]. [1] https://lore.kernel.org/qemu-devel/3da60c1b-6897-7ab1-3a67-bec44fa00...@redhat.com/ --- --- virtio-iommu.tex | 33 - 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/virtio-iommu.tex b/virtio-iommu.tex index 08b358a..d7be13d 100644 --- a/virtio-iommu.tex +++ b/virtio-iommu.tex @@ -664,7 +664,8 @@ \subsubsection{PROBE request}\label{sec:Device Types / IOMMU Device / Device ope \subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties} \begin{lstlisting} -#define VIRTIO_IOMMU_PROBE_T_RESV_MEM 1 +#define VIRTIO_IOMMU_PROBE_T_RESV_MEM 1 +#define VIRTIO_IOMMU_PROBE_T_PAGE_SIZE_MASK 2 \end{lstlisting} \paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / RESVMEM} @@ -727,6 +728,36 @@ \subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / Device The device SHOULD NOT allow accesses from the endpoint to RESV_MEM regions to affect any other component than the endpoint and the driver. +\paragraph{Property PAGE_SIZE_MASK}\label{sec:Device Types / IOMMU Device / Device operations / PROBE properties / PAGESIZEMASK} + +The PAGE_SIZE_MASK property overrides the global \field{page_size_mask} +configuration for an endpoint. + +\begin{lstlisting} +struct virtio_iommu_probe_page_size_mask { + struct virtio_iommu_probe_property head; + u8reserved[4]; + le64 page_size_mask; +}; +\end{lstlisting} + +The \field{page_size_mask} field behaves in the same way as the global +\field{page_size_mask} field. The least significant bit describes the +mapping granularity, and additional bits are hints (see \ref{sec:Device +Types / IOMMU Device / Device operations}) + +\drivernormative{\subparagraph}{Property PAGE_SIZE_MASK}{Device Types / IOMMU Device / Device operations / PROBE properties / PAGESIZEMASK} + +The driver MUST ignore \field{reserved}. + +\devicenormative{\subparagraph}{Property PAGE_SIZE_MASK}{Device Types / IOMMU Device / Device operations / PROBE properties / PAGESIZEMASK} + +The device SHOULD set \field{reserved} to zero. + +The device MUST set at least one bit in \field{page_size_mask}, describing +the page granularity. The device MAY set more than one bit in +\field{page_size_mask}. + \subsubsection{Fault reporting}\label{sev:Device Types / IOMMU Device / Device operations / Fault reporting} The device can report translation faults and other significant -- 2.25.1 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH] virtio-iommu: Remove invalid requirement about padding
This reference to 'padding' is a leftover from a previous draft of the virtio-iommu device. The field doesn't exist anymore, remove the requirement. Signed-off-by: Jean-Philippe Brucker --- I just noticed this was still here, sorry for the delay. Previous discussion: https://lists.oasis-open.org/archives/virtio-dev/201911/msg00084.html --- virtio-iommu.tex | 2 -- 1 file changed, 2 deletions(-) diff --git a/virtio-iommu.tex b/virtio-iommu.tex index 28c562b..08b358a 100644 --- a/virtio-iommu.tex +++ b/virtio-iommu.tex @@ -106,8 +106,6 @@ \subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / \devicenormative{\subsubsection}{Device configuration layout}{Device Types / IOMMU Device / Device configuration layout} -The device SHOULD set \field{padding} to zero. - The device MUST set at least one bit in \field{page_size_mask}, describing the page granularity. The device MAY set more than one bit in \field{page_size_mask}. -- 2.25.1 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [RFC 00/13] virtio-iommu on non-devicetree platforms
On Tue, Dec 03, 2019 at 07:01:36PM -0800, Jacob Pan (Jun) wrote: > Hi Jean, > > Sorry for the delay, I was out last week. Comments inline below. > > On Mon, 25 Nov 2019 19:02:47 +0100 > Jean-Philippe Brucker wrote: > > > On Fri, Nov 22, 2019 at 04:01:02PM -0800, Jacob Pan (Jun) wrote: > > > > (1) ACPI has one table per vendor (DMAR for Intel, IVRS for AMD > > > > and IORT for Arm). From my point of view IORT is easier to > > > > extend, since we just need to introduce a new node type. There > > > > are no dependencies to Arm in the Linux IORT driver, so it works > > > > well with CONFIG_X86. > > > From my limited understanding, IORT and VIOT is to solve device > > > topology enumeration only? I am not sure how it can be expanded to > > > cover information beyond device topology. e.g. DMAR has NUMA > > > information and root port ATS, I guess they are not used today in > > > the guest but might be additions in the future. > > > > The PCI root-complex node of IORT has an ATS attribute, which we can > > already use. However its scope is the root complex, not individual > > root ports like with DMAR. > > > > I'm not very familiar with NUMA, but it looks like we just need to > > specify a proximity domain in relation to the SRAT table, for each > > viommu? The SMMUv3 node in IORT has a 4-bytes "proximity domain" > > field for this. We can add the same to the VIOT virtio-iommu nodes > > later, since the structures are extensible. > > > I think there the proximity domain is more for each assigned device > than vIOMMU. vIOMMU in the guest can have assigned devices belong to > different pIOMMU and proximity domains. If the guest owns the first > level page tables (gIOVA or SVA), we want to make sure page tables are > allocated from the close proximity domain. > > My understanding is virtio IOMMU supports both virtio devices and > assigned devices. we could care less about the former in terms of NUMA. > > In ACPI, we have _PXM method to retrieve device proximity domain. I > don't know if there is something equivalent or a generic way to get > _PXM information. I think VMM also need to make sure when an assigned > device is used with vIOMMU, there are some memory is allocated from the > device's proximity domain. > > > But it might be better to keep the bare minimum information in the FW > > descriptor, and put the rest in the virtio-iommu. So yes topology > > enumeration is something the device cannot do itself (not fully that > > is, see (2)) but for the rest, virtio-iommu's PROBE request can > > provide details about each endpoint in relation to their physical > > IOMMU. > > > > We could for example add a bit in a PROBE property saying that the > > whole path between the IOMMU and the endpoint supports ATS. For NUMA > > it might also be more interesting to have a finer granularity, since > > one viommu could be managing endpoints that are behind different > > physical IOMMUs. If in the future we want to allocate page tables > > close to the physical IOMMU for example, we might need to describe > > multiple NUMA nodes per viommu, using the PROBE request. > > > Should we reinvent something for NUMA or use ACPI's SRAT, _PXM? Regardless whether we put it in the VIOT or in the virtio-iommu PROBE request, we necessarily need to reuse the node IDs defined by SRAT (or numa-node-id on devicetree, also a 32-bit value). A virtio-pci based virtio-iommu already has the _PXM of its closest bridge and wouldn't need anything more in the VIOT, while a virtio-mmio based virtio-iommu would need a proximity domain field in the VIOT. That could be added later since the table is extensible, but as you pointed out, that information might not be very useful. > I am not sure how it is handled today in QEMU in terms of guest-host > NUMA proximity domain mapping. It looks like the user can specify this guest-host mapping on the command-line: -object memory-backend-ram,id=mem0,size=4G,host-nodes=3,policy=bind -object memory-backend-ram,id=mem1,size=4G,host-nodes=4,policy=bind -numa node,memdev=mem0,nodeid=numa0 -numa node,memdev=mem1,nodeid=numa1 -numa cpu,node-id=numa0,socket-id=0 -numa cpu,node-id=numa1,socket-id=1 numa0 and numa1 would get proximity domains 0 and 1, corresponding to host domains 3 and 4. It is also possible to specify the NUMA node of a PCI bus (via the PCI expander bridge), and therefore to assign a VFIO PCI device in the same proximity domain as its physical location. -device pxb,id=bridge1,bus=pci.0,numa_node=1 (simplified) -device vfio-pci,host=03:01.0,bus=bridge1 Linux can use this information to allocate DMA close to the endp
[virtio-dev] Re: [RFC 00/13] virtio-iommu on non-devicetree platforms
On Fri, Nov 22, 2019 at 04:01:02PM -0800, Jacob Pan (Jun) wrote: > > (1) ACPI has one table per vendor (DMAR for Intel, IVRS for AMD and > > IORT for Arm). From my point of view IORT is easier to extend, since > > we just need to introduce a new node type. There are no dependencies > > to Arm in the Linux IORT driver, so it works well with CONFIG_X86. > > > From my limited understanding, IORT and VIOT is to solve device topology > enumeration only? I am not sure how it can be expanded to cover > information beyond device topology. e.g. DMAR has NUMA information and > root port ATS, I guess they are not used today in the guest but might > be additions in the future. The PCI root-complex node of IORT has an ATS attribute, which we can already use. However its scope is the root complex, not individual root ports like with DMAR. I'm not very familiar with NUMA, but it looks like we just need to specify a proximity domain in relation to the SRAT table, for each viommu? The SMMUv3 node in IORT has a 4-bytes "proximity domain" field for this. We can add the same to the VIOT virtio-iommu nodes later, since the structures are extensible. But it might be better to keep the bare minimum information in the FW descriptor, and put the rest in the virtio-iommu. So yes topology enumeration is something the device cannot do itself (not fully that is, see (2)) but for the rest, virtio-iommu's PROBE request can provide details about each endpoint in relation to their physical IOMMU. We could for example add a bit in a PROBE property saying that the whole path between the IOMMU and the endpoint supports ATS. For NUMA it might also be more interesting to have a finer granularity, since one viommu could be managing endpoints that are behind different physical IOMMUs. If in the future we want to allocate page tables close to the physical IOMMU for example, we might need to describe multiple NUMA nodes per viommu, using the PROBE request. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] Re: [RFC 00/13] virtio-iommu on non-devicetree platforms
On Fri, Nov 22, 2019 at 08:00:46AM -0500, Michael S. Tsirkin wrote: > > (2) In addition, there are some concerns about having virtio depend on > > ACPI or DT. Some hypervisors (Firecracker, QEMU microvm, kvmtool x86 > > [1]) > > power? In kvmtool it boot with device tree. It also doesn't need virtio-iommu I think, since it has its own paravirtualized interface. > > don't currently implement those methods. > > > > It was suggested to embed the topology description into the device. > > It can work, as demonstrated at the end of this RFC, with the > > following limitations: > > > > - The topology description must be read before any endpoint managed > > by the IOMMU is probed, and even before the virtio module is > > loaded. This RFC uses a PCI quirk to manually parse the virtio > > configuration. It assumes that all endpoints managed by the IOMMU > > are under this same PCI host. > > > > - I don't have a solution for the virtio-mmio transport at the > > moment, because I haven't had time to modify a host to test it. I > > think it could either use a notifier on the platform bus, or > > better, a new 'iommu' command-line argument to the virtio-mmio > > driver. > > A notifier seems easier for users. What are the disadvantages of > that? For each device we have to check if it's virtio-mmio, then map the MMIO resource and check the device type. Having a dedicated command-line argument would be more efficient. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] [PATCH RESEND v4 1/1] Add virtio-iommu device specification
Hi Jan, On Mon, Nov 25, 2019 at 08:30:29AM +0100, Jan Kiszka wrote: > What's the impact of a fault on the device(s) under the IOMMU regime? Can > they recover? Are you asking about what happens to the endpoints when the virtio-iommu encounters an internal error? Or what happens to the endpoints if their DMA transactions fails translation? I think they are both equivalent to "what happens when the endpoint's memory transaction aborts?". The answer to that depends on the bus and endpoint, and is out of scope. The virtio-iommu spec could state that in those cases, we abort the memory transaction, but it's too vague since we don't know the specifics of the bus, and it isn't necessarily true (see VT-d and SMMU below). > Or will they get DEVICE_NEEDS_RESET as well? If the endpoint is virtio, then the behavior upon DMA fault should be specified by the virtio transport, because it could happen without an IOMMU (e.g. trying to access a physical address that isn't mapped to RAM or MMIO), or with a VT-d emulation for example. But it's not necessarily virtio. It can be a hardware passed-through endpoint, in which case the abort behavior depends on the physical IOMMU, which virtio-iommu doesn't know anything about, in addition to the physical bus and endpoint. I also wouldn't state that the whole device (or function, though we're not necessarily PCI) needs reset. It might be possible for some devices to only stop the faulting queue and leave the others running, to avoid disturbing the rest of the system. > With PCI device > behind real IOMMUs, it's normal that they need a reset after having caused a > fault. I'm not sure if this is described in the related specs for them, but > it should be clarify for the virtual IOMMU. But this can be done on top, > IMHO. The device behaviour is generally not specified. However their spec can say something about the bus: * For Intel VT-d see 7.2 and 7.2.1 (Non-Recoverable Address Translation Faults), where the spec provides various implementation examples. "Requests that encounter non-recoverable address translation faults are aborted by the remapping hardware, and typically require a reset of the device (such as through a function-level-reset) to recover and re-initialize the device to put it back into service." So could be aborted, but as stated later in 7.2.1, can also be redirected to a catch-all memory location. * For Arm SMMU, the host driver can specify for each context whether the SMMU should return an abort (Slave Error on the AMBA bus) or not (read-zero, write-ignore). The spec also says "The behavior of the client device after termination is specific to the device." (3.12.1 Terminate model) * For AMD IOMMU, "when the IOMMU detects an I/O page fault, it target aborts the faulting request." and "the IOMMU sets the legacy PCI Signaled Target Abort bit, if applicable" (2.1.3.2 I/O Page Faults). I believe the equivalent for the PCIe bus is a Completer Abort response. They can specify the behaviour with some precision, because they also specify how the IOMMU is integrated with the system. We don't have this luxury, because if the virtio-iommu is just a proxy for a physical IOMMU, we don't know how aborts are configured, and the bus may be a variant of PCI, AMBA or something else. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [RFC 07/13] ACPI/IORT: Defer probe until virtio-iommu-pci has registered a fwnode
When the IOMMU is PCI-based, IORT doesn't know the fwnode until the driver has had a chance to register it. In addition to deferring the probe until the IOMMU ops are set, also defer the probe until the fwspec is available. Signed-off-by: Jean-Philippe Brucker --- drivers/acpi/iort.c | 54 ++--- 1 file changed, 31 insertions(+), 23 deletions(-) diff --git a/drivers/acpi/iort.c b/drivers/acpi/iort.c index b517aa4e83ba..f08f72d8af78 100644 --- a/drivers/acpi/iort.c +++ b/drivers/acpi/iort.c @@ -61,6 +61,22 @@ static bool iort_type_matches(u8 type, enum iort_node_category category) } } +static inline bool iort_iommu_driver_enabled(u8 type) +{ + switch (type) { + case ACPI_IORT_NODE_SMMU_V3: + return IS_BUILTIN(CONFIG_ARM_SMMU_V3); + case ACPI_IORT_NODE_SMMU: + return IS_BUILTIN(CONFIG_ARM_SMMU); + case ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU: + case ACPI_VIOT_IORT_NODE_VIRTIO_PCI_IOMMU: + return IS_ENABLED(CONFIG_VIRTIO_IOMMU); + default: + pr_warn("IORT node type %u does not describe an IOMMU\n", type); + return false; + } +} + /** * iort_set_fwnode() - Create iort_fwnode and use it to register *iommu data in the iort_fwnode_list @@ -102,9 +118,9 @@ static inline int iort_set_fwnode(struct acpi_iort_node *iort_node, * * Returns: fwnode_handle pointer on success, NULL on failure */ -static inline struct fwnode_handle *iort_get_fwnode( - struct acpi_iort_node *node) +static inline struct fwnode_handle *iort_get_fwnode(struct acpi_iort_node *node) { + int err = -ENODEV; struct iort_fwnode *curr; struct fwnode_handle *fwnode = NULL; @@ -112,12 +128,20 @@ static inline struct fwnode_handle *iort_get_fwnode( list_for_each_entry(curr, _fwnode_list, list) { if (curr->iort_node == node) { fwnode = curr->fwnode; + if (!fwnode && curr->pci_devid) { + /* +* Postpone probe until virtio-iommu has +* registered its fwnode. +*/ + err = iort_iommu_driver_enabled(node->type) ? + -EPROBE_DEFER : -ENODEV; + } break; } } spin_unlock(_fwnode_lock); - return fwnode; + return fwnode ?: ERR_PTR(err); } /** @@ -874,22 +898,6 @@ int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head) return (resv == its->its_count) ? resv : -ENODEV; } -static inline bool iort_iommu_driver_enabled(u8 type) -{ - switch (type) { - case ACPI_IORT_NODE_SMMU_V3: - return IS_BUILTIN(CONFIG_ARM_SMMU_V3); - case ACPI_IORT_NODE_SMMU: - return IS_BUILTIN(CONFIG_ARM_SMMU); - case ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU: - case ACPI_VIOT_IORT_NODE_VIRTIO_PCI_IOMMU: - return IS_ENABLED(CONFIG_VIRTIO_IOMMU); - default: - pr_warn("IORT node type %u does not describe an IOMMU\n", type); - return false; - } -} - static int arm_smmu_iort_xlate(struct device *dev, u32 streamid, struct fwnode_handle *fwnode, const struct iommu_ops *ops) @@ -920,8 +928,8 @@ static int iort_iommu_xlate(struct device *dev, struct acpi_iort_node *node, return -ENODEV; iort_fwnode = iort_get_fwnode(node); - if (!iort_fwnode) - return -ENODEV; + if (IS_ERR(iort_fwnode)) + return PTR_ERR(iort_fwnode); /* * If the ops look-up fails, this means that either @@ -1618,8 +1626,8 @@ static int __init iort_add_platform_device(struct acpi_iort_node *node, fwnode = iort_get_fwnode(node); - if (!fwnode) { - ret = -ENODEV; + if (IS_ERR(fwnode)) { + ret = PTR_ERR(fwnode); goto dev_put; } -- 2.24.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [RFC 06/13] ACPI/IORT: Support VIOT virtio-pci node
When virtio-iommu uses the PCI transport, IORT doesn't instantiate the device and doesn't create a fwnode. They will be created later by the PCI subsystem. Store the information needed to identify the IOMMU in iort_fwnode_list. Signed-off-by: Jean-Philippe Brucker --- drivers/acpi/iort.c | 117 +++- 1 file changed, 93 insertions(+), 24 deletions(-) diff --git a/drivers/acpi/iort.c b/drivers/acpi/iort.c index adc5953fffa5..b517aa4e83ba 100644 --- a/drivers/acpi/iort.c +++ b/drivers/acpi/iort.c @@ -30,10 +30,17 @@ struct iort_its_msi_chip { u32 translation_id; }; +struct iort_pci_devid { + u16 segment; + u8 bus; + u8 devfn; +}; + struct iort_fwnode { struct list_head list; struct acpi_iort_node *iort_node; struct fwnode_handle *fwnode; + struct iort_pci_devid *pci_devid; }; static LIST_HEAD(iort_fwnode_list); static DEFINE_SPINLOCK(iort_fwnode_lock); @@ -44,7 +51,8 @@ static bool iort_type_matches(u8 type, enum iort_node_category category) case IORT_IOMMU_TYPE: return type == ACPI_IORT_NODE_SMMU || type == ACPI_IORT_NODE_SMMU_V3 || - type == ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU; + type == ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU || + type == ACPI_VIOT_IORT_NODE_VIRTIO_PCI_IOMMU; case IORT_MSI_TYPE: return type == ACPI_IORT_NODE_ITS_GROUP; default: @@ -59,12 +67,14 @@ static bool iort_type_matches(u8 type, enum iort_node_category category) * * @node: IORT table node associated with the IOMMU * @fwnode: fwnode associated with the IORT node + * @pci_devid: pci device ID associated with the IORT node, may be NULL * * Returns: 0 on success * <0 on failure */ static inline int iort_set_fwnode(struct acpi_iort_node *iort_node, - struct fwnode_handle *fwnode) + struct fwnode_handle *fwnode, + struct iort_pci_devid *pci_devid) { struct iort_fwnode *np; @@ -76,6 +86,7 @@ static inline int iort_set_fwnode(struct acpi_iort_node *iort_node, INIT_LIST_HEAD(>list); np->iort_node = iort_node; np->fwnode = fwnode; + np->pci_devid = pci_devid; spin_lock(_fwnode_lock); list_add_tail(>list, _fwnode_list); @@ -121,6 +132,7 @@ static inline void iort_delete_fwnode(struct acpi_iort_node *node) spin_lock(_fwnode_lock); list_for_each_entry_safe(curr, tmp, _fwnode_list, list) { if (curr->iort_node == node) { + kfree(curr->pci_devid); list_del(>list); kfree(curr); break; @@ -870,6 +882,7 @@ static inline bool iort_iommu_driver_enabled(u8 type) case ACPI_IORT_NODE_SMMU: return IS_BUILTIN(CONFIG_ARM_SMMU); case ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU: + case ACPI_VIOT_IORT_NODE_VIRTIO_PCI_IOMMU: return IS_ENABLED(CONFIG_VIRTIO_IOMMU); default: pr_warn("IORT node type %u does not describe an IOMMU\n", type); @@ -1451,6 +1464,28 @@ static void __init viommu_mmio_dma_configure(struct device *dev, acpi_dma_configure(dev, attr); } +static __init struct iort_pci_devid * +viommu_pci_get_devid(struct acpi_iort_node *node) +{ + unsigned int val; + struct iort_pci_devid *devid; + struct acpi_viot_iort_virtio_pci_iommu *viommu; + + viommu = (struct acpi_viot_iort_virtio_pci_iommu *)node->node_data; + + val = le32_to_cpu(viommu->devid); + + devid = kzalloc(sizeof(*devid), GFP_KERNEL); + if (!devid) + return ERR_PTR(-ENOMEM); + + devid->segment = val >> 16; + devid->bus = PCI_BUS_NUM(val); + devid->devfn = val & 0xff; + + return devid; +} + struct iort_dev_config { const char *name; int (*dev_init)(struct acpi_iort_node *node); @@ -1462,6 +1497,7 @@ struct iort_dev_config { int (*dev_set_proximity)(struct device *dev, struct acpi_iort_node *node); int (*dev_add_platdata)(struct platform_device *pdev); + struct iort_pci_devid *(*dev_get_pci_devid)(struct acpi_iort_node *node); }; static const struct iort_dev_config iort_arm_smmu_v3_cfg __initconst = { @@ -1494,6 +1530,10 @@ static const struct iort_dev_config iort_viommu_mmio_cfg __initconst = { .dev_init_resources = viommu_mmio_init_resources, }; +static const struct iort_dev_config iort_viommu_pci_cfg __initconst = { + .dev_get_pci_devid = viommu_pci_get_devid, +}; + static __init const struct iort_dev_config *iort_get_dev_cfg( struct acpi_iort_node *node) { @@ -1510,6 +1550,8 @@
[virtio-dev] [RFC 13/13] iommu/virtio: Add topology description to
Some hypervisors don't implement either device-tree or ACPI, but still need a method to describe the IOMMU topology. Read the virtio-iommu config early and parse the topology description. Hook into the dma_setup() callbacks to initialize the IOMMU before probing endpoints. If the virtio-iommu uses the virtio-pci transport, this will only work if the PCI root complex is the first device probed. We don't currently support virtio-mmio. Initially I tried to generate a fake IORT table and feed it to the IORT driver, in order to avoid rewriting the whole DMA code, but it wouldn't work with platform endpoints, which are references to items in the ACPI table on IORT. Signed-off-by: Eric Auger Signed-off-by: Jean-Philippe Brucker --- Note that we only call virt_dma_configure() if the host didn't provide either DT or ACPI method. If you want to test this with QEMU, you'll need to manually disable the acpi_dma_configure() part in pci-driver.c --- drivers/base/platform.c | 3 + drivers/iommu/Kconfig | 9 + drivers/iommu/Makefile| 1 + drivers/iommu/virtio-iommu-topology.c | 410 ++ drivers/iommu/virtio-iommu.c | 3 + drivers/pci/pci-driver.c | 3 + include/linux/virtio_iommu.h | 18 ++ include/uapi/linux/virtio_iommu.h | 26 ++ 8 files changed, 473 insertions(+) create mode 100644 drivers/iommu/virtio-iommu-topology.c create mode 100644 include/linux/virtio_iommu.h diff --git a/drivers/base/platform.c b/drivers/base/platform.c index b230beb6ccb4..70b12c8ef2fb 100644 --- a/drivers/base/platform.c +++ b/drivers/base/platform.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "base.h" #include "power/power.h" @@ -1257,6 +1258,8 @@ int platform_dma_configure(struct device *dev) } else if (has_acpi_companion(dev)) { attr = acpi_get_dma_attr(to_acpi_device_node(dev->fwnode)); ret = acpi_dma_configure(dev, attr); + } else if (IS_ENABLED(CONFIG_VIRTIO_IOMMU_TOPOLOGY)) { + ret = virt_dma_configure(dev); } return ret; diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index e6eb4f238d1a..d02c0d36019d 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -486,4 +486,13 @@ config VIRTIO_IOMMU Say Y here if you intend to run this kernel as a guest. +config VIRTIO_IOMMU_TOPOLOGY + bool "Topology properties for the virtio-iommu" + depends on VIRTIO_IOMMU + help + Enable early probing of the virtio-iommu device, to detect the + topology description. + + Say Y here if you intend to run this kernel as a guest. + endif # IOMMU_SUPPORT diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index 4f405f926e73..6b51c4186ebc 100644 --- a/drivers/iommu/Makefile +++ b/drivers/iommu/Makefile @@ -35,3 +35,4 @@ obj-$(CONFIG_S390_IOMMU) += s390-iommu.o obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o +obj-$(CONFIG_VIRTIO_IOMMU_TOPOLOGY) += virtio-iommu-topology.o diff --git a/drivers/iommu/virtio-iommu-topology.c b/drivers/iommu/virtio-iommu-topology.c new file mode 100644 index ..ec22510ace3d --- /dev/null +++ b/drivers/iommu/virtio-iommu-topology.c @@ -0,0 +1,410 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct viommu_cap_config { + u8 pos; /* PCI capability position */ + u8 bar; + u32 length; /* structure size */ + u32 offset; /* structure offset within the bar */ +}; + +struct viommu_spec { + struct device *dev; /* transport device */ + struct fwnode_handle*fwnode; + struct iommu_ops*ops; + struct list_headtopology; + struct list_headlist; +}; + +struct viommu_topology { + union { + struct virtio_iommu_topo_head head; + struct virtio_iommu_topo_pci_range pci; + struct virtio_iommu_topo_endpoint ep; + }; + /* Index into viommu_spec->topology */ + struct list_head list; +}; + +static LIST_HEAD(viommus); +static DEFINE_MUTEX(viommus_lock); + +#define VPCI_FIELD(field) offsetof(struct virtio_pci_cap, field) + +static inline int viommu_find_capability(struct pci_dev *dev, u8 cfg_type, +struct viommu_cap_config *cap) +{ + int pos; + u8 bar; + + for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR); +pos > 0; +pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) { + u8 type; + + pci_read_config_byte(dev, pos + VPCI_FIELD(cfg_type), ); + if (type != cfg_type) + continue; + +
[virtio-dev] [RFC 08/13] ACPI/IORT: Add callback to update a device's fwnode
For a PCI-based IOMMU, IORT isn't in charge of allocating a fwnode. Let the IOMMU driver update the fwnode associated to an IORT node when available. Signed-off-by: Jean-Philippe Brucker --- drivers/acpi/iort.c | 38 ++ include/linux/acpi_iort.h | 4 2 files changed, 42 insertions(+) diff --git a/drivers/acpi/iort.c b/drivers/acpi/iort.c index f08f72d8af78..8263ab275b2b 100644 --- a/drivers/acpi/iort.c +++ b/drivers/acpi/iort.c @@ -1038,11 +1038,49 @@ const struct iommu_ops *iort_iommu_configure(struct device *dev) return ops; } + +/** + * iort_iommu_update_fwnode - update fwnode of a PCI IOMMU + * @dev: the IOMMU device + * @fwnode: the fwnode, or NULL to remove an existing fwnode + * + * A PCI device isn't instantiated by the IORT driver. The IOMMU driver sets or + * removes its fwnode using this function. + */ +void iort_iommu_update_fwnode(struct device *dev, struct fwnode_handle *fwnode) +{ + struct pci_dev *pdev; + struct iort_fwnode *curr; + struct iort_pci_devid *devid; + + if (!dev_is_pci(dev)) + return; + + pdev = to_pci_dev(dev); + + spin_lock(_fwnode_lock); + list_for_each_entry(curr, _fwnode_list, list) { + devid = curr->pci_devid; + if (devid && + pci_domain_nr(pdev->bus) == devid->segment && + pdev->bus->number == devid->bus && + pdev->devfn == devid->devfn) { + WARN_ON(fwnode && curr->fwnode); + curr->fwnode = fwnode; + break; + } + } + spin_unlock(_fwnode_lock); +} +EXPORT_SYMBOL_GPL(iort_iommu_update_fwnode); #else int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head) { return 0; } const struct iommu_ops *iort_iommu_configure(struct device *dev) { return NULL; } +static void iort_iommu_update_fwnode(struct device *dev, +struct fwnode_handle *fwnode) +{ } #endif static int nc_dma_get_range(struct device *dev, u64 *size) diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h index f4db5fff07cf..840635e40d9d 100644 --- a/include/linux/acpi_iort.h +++ b/include/linux/acpi_iort.h @@ -43,6 +43,7 @@ int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id); void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *size); const struct iommu_ops *iort_iommu_configure(struct device *dev); int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head); +void iort_iommu_update_fwnode(struct device *dev, struct fwnode_handle *fwnode); #else static void acpi_iort_register_table(struct acpi_table_header *table, enum iort_table_source source) @@ -63,6 +64,9 @@ static inline const struct iommu_ops *iort_iommu_configure( static inline int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head) { return 0; } +static void iort_iommu_update_fwnode(struct device *dev, +struct fwnode_handle *fwnode) +{ } #endif #endif /* __ACPI_IORT_H__ */ -- 2.24.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [RFC virtio 12/13] virtio-iommu: Add built-in topology description
Add a lightweight method to describe the IOMMU topology in the config space, guarded by a new feature bit. A list of capabilities in the config space describes the devices managed by the IOMMU and their endpoint IDs. Signed-off-by: Jean-Philippe Brucker --- virtio-iommu.tex | 88 1 file changed, 88 insertions(+) diff --git a/virtio-iommu.tex b/virtio-iommu.tex index 28c562b..2b29873 100644 --- a/virtio-iommu.tex +++ b/virtio-iommu.tex @@ -67,6 +67,9 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU Device / Feature bits} \item[VIRTIO_IOMMU_F_MMIO (5)] The VIRTIO_IOMMU_MAP_F_MMIO flag is available. + +\item[VIRTIO_IOMMU_F_TOPOLOGY (6)] + Topology description is available at \field{topo_offset}. \end{description} \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / Feature bits} @@ -97,6 +100,7 @@ \subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device / le32 end; } domain_range; le32 probe_size; + le16 topo_offset; }; \end{lstlisting} @@ -141,6 +145,90 @@ \subsection{Device initialization}\label{sec:Device Types / IOMMU Device / Devic If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the device SHOULD NOT let endpoints access the guest-physical address space. +\subsubsection{Built-in topology description}\label{sec:Device Types / IOMMU Device / Device initialization / topology} + +The device manages memory accesses from endpoints, identified by endpoint +IDs. The driver can discover which endpoint ID corresponds to an endpoint +using several methods, depending on the platform. Platforms described +with device tree use the \texttt{iommus} and \texttt{iommu-map} properties +embedded into device nodes for this purpose. Platforms described with +ACPI use a table such as the Virtual I/O Table. Platforms that do not +support either device tree or ACPI may embed a minimalistic description +in the device configuration space. + +An important disadvantage of describing the topology from within the +device is the lack of initialization ordering information. Out-of-band +descriptions such as device tree and ACPI let the operating system know +about device dependencies so that it can initialize supplier devices +(IOMMUs) before their consumers (endpoints). Platforms using the +VIRTIO_IOMMU_F_TOPOLOGY feature have to communicate the device dependency +in another way. + +If the VIRTIO_IOMMU_F_TOPOLOGY feature is negotiated, \field{topo_offset} +is the offset between the beginning of the device-specific configuration +space (virtio_iommu_config) and the first topology structure header. A +topology structures defines the endpoint ID of one or more endpoints +managed by the virtio-iommu device. + +\begin{lstlisting} +struct virtio_iommu_topo_head { + le16 type; + le16 next; +}; +\end{lstlisting} + +\field{next} is the offset between the beginning of the device-specific +configuration space and the next topology structure header. When +\field{next} is zero, this is the last structure. + +\field{type} describes the type of structure: +\begin{description} + \item[VIRTIO_IOMMU_TOPO_PCI_RANGE (0)] struct virtio_iommu_topo_pci_range + \item[VIRTIO_IOMMU_TOPO_ENDPOINT (1)] struct virtio_iommu_topo_endpoint +\end{description} + +\paragraph{PCI range}\label{sec:Device Types / IOMMU Device / Device initialization / topology / PCI range} + +\begin{lstlisting} +struct virtio_iommu_topo_pci_range { + struct virtio_iommu_topo_head head; + le32 endpoint_start; + le16 hierarchy; + le16 requester_start; + le16 requester_end; + le16 reserved; +}; +\end{lstlisting} + +The PCI range structure describes the endpoint IDs of a series of PCI +devices. + +\begin{description} + \item[\field{hierarchy}] Identifier of the PCI hierarchy. Sometimes +called PCI segment or domain number. + \item[\field{requester_start}] First requester ID in the range. + \item[\field{requester_end}] Last requester ID in the range. + \item[\field{endpoint_start}] First endpoint ID. +\end{description} + +The correspondence between a PCI requester ID in the range +[ requester_start; requester_end ] and its endpoint IDs is a linear +transformation: endpoint_id = requester_id - requester_start + +endpoint_start. + +\paragraph{Single endpoint}\label{sec:Device Types / IOMMU Device / Device initialization / topology / Single endpoint} + +\begin{lstlisting} +struct virtio_iommu_topo_endpoint { + struct virtio_iommu_topo_head head; + le32 endpoint; + le64 address; +}; +\end{lstlisting} + +\field{endpoint} is the ID of a single endpoint, identified by its first +MMIO address in the physical address space. + \subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device operations} Driver send requests on the request virtqueue, notifies the device and -- 2.24.0 - To unsubscribe, e-mail: virtio-dev-unsubscr
[virtio-dev] [RFC 09/13] iommu/virtio: Create fwnode if necessary
The presence of a fwnode on a PCI device depends on the platform. QEMU q35, for example, creates an ACPI description for each PCI slot, but QEMU virt (aarch64) doesn't. Since the IOMMU subsystem relies heavily on fwnode to discover the DMA topology, create a fwnode for the virtio-iommu if necessary, using the software_node framework. Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/virtio-iommu.c | 56 1 file changed, 51 insertions(+), 5 deletions(-) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 3ea9d7682999..8efa368134c0 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -966,6 +966,48 @@ static struct iommu_ops viommu_ops = { .of_xlate = viommu_of_xlate, }; +static int viommu_set_fwnode(struct viommu_dev *viommu) +{ + /* +* viommu->dev is the virtio device, its parent is the associated +* transport device. +*/ + struct device *dev = viommu->dev->parent; + + /* +* With device tree a fwnode is always present. With ACPI, on some +* platforms a PCI device has a DSDT node describing the slot. On other +* platforms, no fwnode is created and we have to do it ourselves. +*/ + if (!dev->fwnode) { + struct fwnode_handle *fwnode; + + fwnode = fwnode_create_software_node(NULL, NULL); + if (IS_ERR(fwnode)) + return PTR_ERR(fwnode); + + set_primary_fwnode(dev, fwnode); + } + + iommu_device_set_fwnode(>iommu, dev->fwnode); + return 0; +} + +static void viommu_clear_fwnode(struct viommu_dev *viommu) +{ + struct device *dev = viommu->dev->parent; + + if (!dev->fwnode) + return; + + if (is_software_node(dev->fwnode)) { + struct fwnode_handle *fwnode = dev->fwnode; + + set_primary_fwnode(dev, NULL); + fwnode_remove_software_node(fwnode); + } +} + static int viommu_init_vqs(struct viommu_dev *viommu) { struct virtio_device *vdev = dev_to_virtio(viommu->dev); @@ -1004,7 +1046,6 @@ static int viommu_fill_evtq(struct viommu_dev *viommu) static int viommu_probe(struct virtio_device *vdev) { - struct device *parent_dev = vdev->dev.parent; struct viommu_dev *viommu = NULL; struct device *dev = >dev; u64 input_start = 0; @@ -1084,9 +1125,11 @@ static int viommu_probe(struct virtio_device *vdev) if (ret) goto err_free_vqs; - iommu_device_set_ops(>iommu, _ops); - iommu_device_set_fwnode(>iommu, parent_dev->fwnode); + ret = viommu_set_fwnode(viommu); + if (ret) + goto err_sysfs_remove; + iommu_device_set_ops(>iommu, _ops); iommu_device_register(>iommu); #ifdef CONFIG_PCI @@ -1119,8 +1162,10 @@ static int viommu_probe(struct virtio_device *vdev) return 0; err_unregister: - iommu_device_sysfs_remove(>iommu); iommu_device_unregister(>iommu); + viommu_clear_fwnode(viommu); +err_sysfs_remove: + iommu_device_sysfs_remove(>iommu); err_free_vqs: vdev->config->del_vqs(vdev); @@ -1131,8 +1176,9 @@ static void viommu_remove(struct virtio_device *vdev) { struct viommu_dev *viommu = vdev->priv; - iommu_device_sysfs_remove(>iommu); iommu_device_unregister(>iommu); + viommu_clear_fwnode(viommu); + iommu_device_sysfs_remove(>iommu); /* Stop all virtqueues */ vdev->config->reset(vdev); -- 2.24.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [RFC 11/13] ACPI: Add VIOT table
Add support for a new ACPI table that embeds other tables describing a platform's IOMMU topology. Currently the only supported base table is IORT. The VIOT contains an IORT with additional node types, that describe a virtio-iommu. Signed-off-by: Jean-Philippe Brucker --- drivers/acpi/Kconfig | 4 drivers/acpi/Makefile | 1 + drivers/acpi/bus.c| 2 ++ drivers/acpi/tables.c | 2 +- drivers/acpi/viot.c | 44 +++ drivers/iommu/Kconfig | 1 + include/linux/acpi_viot.h | 20 ++ 7 files changed, 73 insertions(+), 1 deletion(-) create mode 100644 drivers/acpi/viot.c create mode 100644 include/linux/acpi_viot.h diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index 548976c8b2b0..513a5e4d3526 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -579,6 +579,10 @@ config TPS68470_PMIC_OPREGION config ACPI_IORT bool +config ACPI_VIOT + bool + select ACPI_IORT + endif # ACPI config X86_PM_TIMER diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index 9d1792165713..6abdc6cc32c7 100644 --- a/drivers/acpi/Makefile +++ b/drivers/acpi/Makefile @@ -124,3 +124,4 @@ obj-y += dptf/ obj-$(CONFIG_ARM64)+= arm64/ obj-$(CONFIG_ACPI_IORT)+= iort.o +obj-$(CONFIG_ACPI_VIOT)+= viot.o diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c index 48bc96d45bab..6f364e0c9240 100644 --- a/drivers/acpi/bus.c +++ b/drivers/acpi/bus.c @@ -25,6 +25,7 @@ #include #endif #include +#include #include #include #include @@ -1246,6 +1247,7 @@ static int __init acpi_init(void) pci_mmcfg_late_init(); acpi_iort_init(); + acpi_viot_init(); acpi_scan_init(); acpi_ec_init(); acpi_debugfs_init(); diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c index 180ac4329763..9662ea5e1064 100644 --- a/drivers/acpi/tables.c +++ b/drivers/acpi/tables.c @@ -501,7 +501,7 @@ static const char * const table_sigs[] = { ACPI_SIG_WDDT, ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT, ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, ACPI_SIG_IORT, ACPI_SIG_NFIT, ACPI_SIG_HMAT, ACPI_SIG_PPTT, - NULL }; + ACPI_SIG_VIOT, NULL }; #define ACPI_HEADER_SIZE sizeof(struct acpi_table_header) diff --git a/drivers/acpi/viot.c b/drivers/acpi/viot.c new file mode 100644 index ..ab9a6e43ad9b --- /dev/null +++ b/drivers/acpi/viot.c @@ -0,0 +1,44 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2019 Linaro + * + * Virtual IOMMU table + */ +#define pr_fmt(fmt)"ACPI: VIOT: " fmt + +#include +#include +#include + +int __init acpi_viot_init(void) +{ + struct acpi_table_viot *viot; + struct acpi_table_header *acpi_header; + acpi_status status; + + status = acpi_get_table(ACPI_SIG_VIOT, 0, _header); + if (ACPI_FAILURE(status)) { + if (status != AE_NOT_FOUND) { + const char *msg = acpi_format_exception(status); + + pr_err("Failed to get table, %s\n", msg); + return -EINVAL; + } + + return 0; + } + + if (acpi_header->length < sizeof(*viot)) { + pr_err("VIOT table overflow, bad table!\n"); + return -EINVAL; + } + + viot = (struct acpi_table_viot *)acpi_header; + if (ACPI_COMPARE_NAMESEG(viot->base_table.signature, ACPI_SIG_IORT)) { + acpi_iort_register_table(>base_table, IORT_SOURCE_VIOT); + return 0; + } + + pr_err("Unknown base table header\n"); + return -EINVAL; +} diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index e3842eabcfdd..e6eb4f238d1a 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -480,6 +480,7 @@ config VIRTIO_IOMMU depends on ARM64 select IOMMU_API select INTERVAL_TREE + select ACPI_VIOT if ACPI help Para-virtualised IOMMU driver with virtio. diff --git a/include/linux/acpi_viot.h b/include/linux/acpi_viot.h new file mode 100644 index ..6c282d5eb793 --- /dev/null +++ b/include/linux/acpi_viot.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2019 Linaro + */ + +#ifndef __ACPI_VIOT_H__ +#define __ACPI_VIOT_H__ + +#ifdef CONFIG_ACPI_VIOT + +int acpi_viot_init(void); + +#else /* !CONFIG_ACPI_VIOT */ + +static inline int acpi_viot_init(void) +{} + +#endif /* !CONFIG_ACPI_VIOT */ + +#endif /* __ACPI_VIOT_H__ */ -- 2.24.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [RFC 03/13] ACPI/IORT: Allow registration of external tables
Add a function to register an IORT table from an external source. Signed-off-by: Jean-Philippe Brucker --- drivers/acpi/iort.c | 22 -- include/linux/acpi_iort.h | 10 ++ 2 files changed, 30 insertions(+), 2 deletions(-) diff --git a/drivers/acpi/iort.c b/drivers/acpi/iort.c index d62a9ea26fae..9c6c91e06f8f 100644 --- a/drivers/acpi/iort.c +++ b/drivers/acpi/iort.c @@ -144,6 +144,7 @@ typedef acpi_status (*iort_find_node_callback) /* Root pointer to the mapped IORT table */ static struct acpi_table_header *iort_table; +static enum iort_table_source iort_table_source; static LIST_HEAD(iort_msi_chip_list); static DEFINE_SPINLOCK(iort_msi_chip_lock); @@ -1617,11 +1618,28 @@ static void __init iort_init_platform_devices(void) } } +void __init acpi_iort_register_table(struct acpi_table_header *table, +enum iort_table_source source) +{ + /* +* Firmware or hypervisor should know better than give us two IORT +* tables. +*/ + if (WARN_ON(iort_table)) + return; + + iort_table = table; + iort_table_source = source; + + iort_init_platform_devices(); +} + void __init acpi_iort_init(void) { acpi_status status; + static struct acpi_table_header *table; - status = acpi_get_table(ACPI_SIG_IORT, 0, _table); + status = acpi_get_table(ACPI_SIG_IORT, 0, ); if (ACPI_FAILURE(status)) { if (status != AE_NOT_FOUND) { const char *msg = acpi_format_exception(status); @@ -1632,5 +1650,5 @@ void __init acpi_iort_init(void) return; } - iort_init_platform_devices(); + acpi_iort_register_table(table, IORT_SOURCE_IORT); } diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h index 8e7e2ec37f1b..f4db5fff07cf 100644 --- a/include/linux/acpi_iort.h +++ b/include/linux/acpi_iort.h @@ -11,6 +11,11 @@ #include #include +enum iort_table_source { + IORT_SOURCE_IORT, /* The Real Thing */ + IORT_SOURCE_VIOT, /* Paravirtual extensions */ +}; + #define IORT_IRQ_MASK(irq) (irq & 0xULL) #define IORT_IRQ_TRIGGER_MASK(irq) ((irq >> 32) & 0xULL) @@ -27,6 +32,8 @@ int iort_register_domain_token(int trans_id, phys_addr_t base, void iort_deregister_domain_token(int trans_id); struct fwnode_handle *iort_find_domain_token(int trans_id); #ifdef CONFIG_ACPI_IORT +void acpi_iort_register_table(struct acpi_table_header *table, + enum iort_table_source source); void acpi_iort_init(void); u32 iort_msi_map_rid(struct device *dev, u32 req_id); struct irq_domain *iort_get_device_domain(struct device *dev, u32 req_id); @@ -37,6 +44,9 @@ void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *size); const struct iommu_ops *iort_iommu_configure(struct device *dev); int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head); #else +static void acpi_iort_register_table(struct acpi_table_header *table, +enum iort_table_source source) +{ } static inline void acpi_iort_init(void) { } static inline u32 iort_msi_map_rid(struct device *dev, u32 req_id) { return req_id; } -- 2.24.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [RFC 10/13] iommu/virtio: Update IORT fwnode
When the virtio-iommu uses the PCI transport and the topology is described with IORT, register the PCI fwnode with IORT. Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/virtio-iommu.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 8efa368134c0..9847552faecc 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -7,6 +7,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include #include #include #include @@ -989,6 +990,8 @@ static int viommu_set_fwnode(struct viommu_dev *viommu) set_primary_fwnode(dev, fwnode); } + /* Tell IORT about a PCI device's fwnode */ + iort_iommu_update_fwnode(dev, dev->fwnode); iommu_device_set_fwnode(>iommu, dev->fwnode); return 0; } @@ -1000,6 +1003,8 @@ static void viommu_clear_fwnode(struct viommu_dev *viommu) if (!dev->fwnode) return; + iort_iommu_update_fwnode(dev, NULL); + if (is_software_node(dev->fwnode)) { struct fwnode_handle *fwnode = dev->fwnode; -- 2.24.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [RFC 01/13] ACPI/IORT: Move IORT to the ACPI folder
IORT can be used (by QEMU) to describe a virtual topology containing an architecture-agnostic paravirtualized device. In order to build IORT for x86 systems, the driver has to be moved outside of arm64/. Since there is nothing specific to arm64 in the driver, it simply requires moving Makefile and Kconfig entries. Signed-off-by: Jean-Philippe Brucker --- MAINTAINERS | 9 + drivers/acpi/Kconfig| 3 +++ drivers/acpi/Makefile | 1 + drivers/acpi/arm64/Kconfig | 3 --- drivers/acpi/arm64/Makefile | 1 - drivers/acpi/{arm64 => }/iort.c | 0 6 files changed, 13 insertions(+), 4 deletions(-) rename drivers/acpi/{arm64 => }/iort.c (100%) diff --git a/MAINTAINERS b/MAINTAINERS index eb19fad370d7..9153d278f67e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -377,6 +377,15 @@ L: platform-driver-...@vger.kernel.org S: Maintained F: drivers/platform/x86/i2c-multi-instantiate.c +ACPI IORT DRIVER +M: Lorenzo Pieralisi +M: Hanjun Guo +M: Sudeep Holla +L: linux-a...@vger.kernel.org +L: linux-arm-ker...@lists.infradead.org (moderated for non-subscribers) +S: Maintained +F: drivers/acpi/iort.c + ACPI PMIC DRIVERS M: "Rafael J. Wysocki" M: Len Brown diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index ebe1e9e5fd81..548976c8b2b0 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -576,6 +576,9 @@ config TPS68470_PMIC_OPREGION region, which must be available before any of the devices using this, are probed. +config ACPI_IORT + bool + endif # ACPI config X86_PM_TIMER diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index 5d361e4e3405..9d1792165713 100644 --- a/drivers/acpi/Makefile +++ b/drivers/acpi/Makefile @@ -123,3 +123,4 @@ video-objs += acpi_video.o video_detect.o obj-y += dptf/ obj-$(CONFIG_ARM64)+= arm64/ +obj-$(CONFIG_ACPI_IORT)+= iort.o diff --git a/drivers/acpi/arm64/Kconfig b/drivers/acpi/arm64/Kconfig index 6dba187f4f2e..d0902c85d46e 100644 --- a/drivers/acpi/arm64/Kconfig +++ b/drivers/acpi/arm64/Kconfig @@ -3,8 +3,5 @@ # ACPI Configuration for ARM64 # -config ACPI_IORT - bool - config ACPI_GTDT bool diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile index 6ff50f4ed947..38771a816caf 100644 --- a/drivers/acpi/arm64/Makefile +++ b/drivers/acpi/arm64/Makefile @@ -1,3 +1,2 @@ # SPDX-License-Identifier: GPL-2.0-only -obj-$(CONFIG_ACPI_IORT)+= iort.o obj-$(CONFIG_ACPI_GTDT)+= gtdt.o diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/iort.c similarity index 100% rename from drivers/acpi/arm64/iort.c rename to drivers/acpi/iort.c -- 2.24.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [RFC 05/13] ACPI/IORT: Support VIOT virtio-mmio node
Add a new type of node to the IORT driver, that describes a virtio-iommu device based on the virtio-mmio transport. The node is only available when the IORT is a sub-table of the VIOT. Signed-off-by: Jean-Philippe Brucker --- drivers/acpi/iort.c | 66 ++--- 1 file changed, 62 insertions(+), 4 deletions(-) diff --git a/drivers/acpi/iort.c b/drivers/acpi/iort.c index 1d43fbc0001f..adc5953fffa5 100644 --- a/drivers/acpi/iort.c +++ b/drivers/acpi/iort.c @@ -43,7 +43,8 @@ static bool iort_type_matches(u8 type, enum iort_node_category category) switch (category) { case IORT_IOMMU_TYPE: return type == ACPI_IORT_NODE_SMMU || - type == ACPI_IORT_NODE_SMMU_V3; + type == ACPI_IORT_NODE_SMMU_V3 || + type == ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU; case IORT_MSI_TYPE: return type == ACPI_IORT_NODE_ITS_GROUP; default: @@ -868,8 +869,10 @@ static inline bool iort_iommu_driver_enabled(u8 type) return IS_BUILTIN(CONFIG_ARM_SMMU_V3); case ACPI_IORT_NODE_SMMU: return IS_BUILTIN(CONFIG_ARM_SMMU); + case ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU: + return IS_ENABLED(CONFIG_VIRTIO_IOMMU); default: - pr_warn("IORT node type %u does not describe an SMMU\n", type); + pr_warn("IORT node type %u does not describe an IOMMU\n", type); return false; } } @@ -1408,6 +1411,46 @@ static int __init arm_smmu_v3_pmcg_add_platdata(struct platform_device *pdev) return platform_device_add_data(pdev, , sizeof(model)); } +static int __init viommu_mmio_count_resources(struct acpi_iort_node *node) +{ + /* Mem + IRQ */ + return 2; +} + +static void __init viommu_mmio_init_resources(struct resource *res, + struct acpi_iort_node *node) +{ + int hw_irq, trigger; + struct acpi_viot_iort_virtio_mmio_iommu *viommu; + + viommu = (struct acpi_viot_iort_virtio_mmio_iommu *)node->node_data; + + res[0].start = viommu->base_address; + res[0].end = viommu->base_address + viommu->span - 1; + res[0].flags = IORESOURCE_MEM; + + hw_irq = IORT_IRQ_MASK(viommu->interrupt); + trigger = IORT_IRQ_TRIGGER_MASK(viommu->interrupt); + acpi_iort_register_irq(hw_irq, "viommu", trigger, res + 1); +} + +static void __init viommu_mmio_dma_configure(struct device *dev, + struct acpi_iort_node *node) +{ + enum dev_dma_attr attr; + struct acpi_viot_iort_virtio_mmio_iommu *viommu; + + viommu = (struct acpi_viot_iort_virtio_mmio_iommu *)node->node_data; + + attr = (viommu->flags & ACPI_VIOT_IORT_VIRTIO_MMIO_IOMMU_CACHE_COHERENT) ? + DEV_DMA_COHERENT : DEV_DMA_NON_COHERENT; + + dev->dma_mask = >coherent_dma_mask; + + /* Configure DMA for the page table walker */ + acpi_dma_configure(dev, attr); +} + struct iort_dev_config { const char *name; int (*dev_init)(struct acpi_iort_node *node); @@ -1443,6 +1486,14 @@ static const struct iort_dev_config iort_arm_smmu_v3_pmcg_cfg __initconst = { .dev_add_platdata = arm_smmu_v3_pmcg_add_platdata, }; +static const struct iort_dev_config iort_viommu_mmio_cfg __initconst = { + /* Probe with the generic virtio-mmio driver */ + .name = "virtio-mmio", + .dev_dma_configure = viommu_mmio_dma_configure, + .dev_count_resources = viommu_mmio_count_resources, + .dev_init_resources = viommu_mmio_init_resources, +}; + static __init const struct iort_dev_config *iort_get_dev_cfg( struct acpi_iort_node *node) { @@ -1453,9 +1504,16 @@ static __init const struct iort_dev_config *iort_get_dev_cfg( return _arm_smmu_cfg; case ACPI_IORT_NODE_PMCG: return _arm_smmu_v3_pmcg_cfg; - default: - return NULL; } + + if (iort_table_source == IORT_SOURCE_VIOT) { + switch (node->type) { + case ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU: + return _viommu_mmio_cfg; + } + } + + return NULL; } /** -- 2.24.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [RFC 04/13] ACPI/IORT: Add node categories
The current node filtering won't work when introducing node types greater than 63 (such as the virtio-iommu nodes). Add node_type_matches() to filter nodes by category. Signed-off-by: Jean-Philippe Brucker --- drivers/acpi/iort.c | 34 -- 1 file changed, 24 insertions(+), 10 deletions(-) diff --git a/drivers/acpi/iort.c b/drivers/acpi/iort.c index 9c6c91e06f8f..1d43fbc0001f 100644 --- a/drivers/acpi/iort.c +++ b/drivers/acpi/iort.c @@ -18,10 +18,10 @@ #include #include -#define IORT_TYPE_MASK(type) (1 << (type)) -#define IORT_MSI_TYPE (1 << ACPI_IORT_NODE_ITS_GROUP) -#define IORT_IOMMU_TYPE((1 << ACPI_IORT_NODE_SMMU) | \ - (1 << ACPI_IORT_NODE_SMMU_V3)) +enum iort_node_category { + IORT_MSI_TYPE, + IORT_IOMMU_TYPE, +}; struct iort_its_msi_chip { struct list_headlist; @@ -38,6 +38,20 @@ struct iort_fwnode { static LIST_HEAD(iort_fwnode_list); static DEFINE_SPINLOCK(iort_fwnode_lock); +static bool iort_type_matches(u8 type, enum iort_node_category category) +{ + switch (category) { + case IORT_IOMMU_TYPE: + return type == ACPI_IORT_NODE_SMMU || + type == ACPI_IORT_NODE_SMMU_V3; + case IORT_MSI_TYPE: + return type == ACPI_IORT_NODE_ITS_GROUP; + default: + WARN_ON(1); + return false; + } +} + /** * iort_set_fwnode() - Create iort_fwnode and use it to register *iommu data in the iort_fwnode_list @@ -397,7 +411,7 @@ static int iort_get_id_mapping_index(struct acpi_iort_node *node) static struct acpi_iort_node *iort_node_map_id(struct acpi_iort_node *node, u32 id_in, u32 *id_out, - u8 type_mask) + enum iort_node_category category) { u32 id = id_in; @@ -406,7 +420,7 @@ static struct acpi_iort_node *iort_node_map_id(struct acpi_iort_node *node, struct acpi_iort_id_mapping *map; int i, index; - if (IORT_TYPE_MASK(node->type) & type_mask) { + if (iort_type_matches(node->type, category)) { if (id_out) *id_out = id; return node; @@ -458,8 +472,8 @@ static struct acpi_iort_node *iort_node_map_id(struct acpi_iort_node *node, } static struct acpi_iort_node *iort_node_map_platform_id( - struct acpi_iort_node *node, u32 *id_out, u8 type_mask, - int index) + struct acpi_iort_node *node, u32 *id_out, + enum iort_node_category category, int index) { struct acpi_iort_node *parent; u32 id; @@ -475,8 +489,8 @@ static struct acpi_iort_node *iort_node_map_platform_id( * as NC (named component) -> SMMU -> ITS. If the type is matched, * return the initial dev id and its parent pointer directly. */ - if (!(IORT_TYPE_MASK(parent->type) & type_mask)) - parent = iort_node_map_id(parent, id, id_out, type_mask); + if (!iort_type_matches(parent->type, category)) + parent = iort_node_map_id(parent, id, id_out, category); else if (id_out) *id_out = id; -- 2.24.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [RFC 02/13] ACPI: Add VIOT definitions
This is temporary, until the VIOT table is published and these definitions added to ACPICA. Signed-off-by: Jean-Philippe Brucker --- include/acpi/actbl2.h | 31 +++ 1 file changed, 31 insertions(+) diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h index e45ced27f4c3..99c1d747e9d8 100644 --- a/include/acpi/actbl2.h +++ b/include/acpi/actbl2.h @@ -25,6 +25,7 @@ * the wrong signature. */ #define ACPI_SIG_IORT "IORT" /* IO Remapping Table */ +#define ACPI_SIG_VIOT "VIOT" /* Virtual I/O Table */ #define ACPI_SIG_IVRS "IVRS" /* I/O Virtualization Reporting Structure */ #define ACPI_SIG_LPIT "LPIT" /* Low Power Idle Table */ #define ACPI_SIG_MADT "APIC" /* Multiple APIC Description Table */ @@ -412,6 +413,36 @@ struct acpi_ivrs_memory { u64 memory_length; }; +/*** + * + * VIOT - Virtual I/O Table + *Version 1 + * + **/ + +struct acpi_table_viot { + struct acpi_table_header header; + u8 reserved[12]; + struct acpi_table_header base_table; +}; + +#define ACPI_VIOT_IORT_NODE_VIRTIO_PCI_IOMMU0x80 +#define ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU 0x81 + +struct acpi_viot_iort_virtio_pci_iommu { + u32 devid; +}; + +struct acpi_viot_iort_virtio_mmio_iommu { + u64 base_address; + u64 span; + u64 flags; + u64 interrupt; +}; + +/* FIXME: rename this monstrosity. */ +#define ACPI_VIOT_IORT_VIRTIO_MMIO_IOMMU_CACHE_COHERENT (1<<0) + /*** * * LPIT - Low Power Idle Table -- 2.24.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [RFC 00/13] virtio-iommu on non-devicetree platforms
I'm seeking feedback on multi-platform support for virtio-iommu. At the moment only devicetree (DT) is supported and we don't have a pleasant solution for other platforms. Once we figure out the topology description, x86 support is trivial. Since the IOMMU manages memory accesses from other devices, the guest kernel needs to initialize the IOMMU before endpoints start issuing DMA. It's a solved problem: firmware or hypervisor describes through DT or ACPI tables the device dependencies, and probe of endpoints is deferred until the IOMMU is probed. But: (1) ACPI has one table per vendor (DMAR for Intel, IVRS for AMD and IORT for Arm). From my point of view IORT is easier to extend, since we just need to introduce a new node type. There are no dependencies to Arm in the Linux IORT driver, so it works well with CONFIG_X86. However, there are concerns about other OS vendors feeling obligated to implement this new node, so Arm proposed introducing another ACPI table, that can wrap any of DMAR, IVRS and IORT to extend it with new virtual nodes. A draft of this VIOT table specification is available at http://jpbrucker.net/virtio-iommu/viot/viot-v5.pdf I'm afraid this could increase fragmentation as guests would need to implement or modify their support for all of DMAR, IVRS and IORT. If we end up doing VIOT, I suggest limiting it to IORT. (2) In addition, there are some concerns about having virtio depend on ACPI or DT. Some hypervisors (Firecracker, QEMU microvm, kvmtool x86 [1]) don't currently implement those methods. It was suggested to embed the topology description into the device. It can work, as demonstrated at the end of this RFC, with the following limitations: - The topology description must be read before any endpoint managed by the IOMMU is probed, and even before the virtio module is loaded. This RFC uses a PCI quirk to manually parse the virtio configuration. It assumes that all endpoints managed by the IOMMU are under this same PCI host. - I don't have a solution for the virtio-mmio transport at the moment, because I haven't had time to modify a host to test it. I think it could either use a notifier on the platform bus, or better, a new 'iommu' command-line argument to the virtio-mmio driver. So the current prototype doesn't work for firecracker and microvm, which rely on virtio-mmio. - For Arm, if the platform has an ITS, the hypervisor needs IORT or DT to describe it anyway. More generally, not using either ACPI or DT might prevent from supporting other features as well. I suspect the above users will have to implement a standard method sooner or later. - Even when reusing as much existing code as possible, guest support is still going to be around a few hundred lines since we can't rely on the normal virtio infrastructure to be loaded at that point. As you can see below, the diffstat for the incomplete topology implementation is already bigger than the exhaustive IORT support, even when jumping through the VIOT hoop. So it's a lightweight solution for very specific use-cases, and we should still support ACPI for the general case. Multi-platform guests such as Linux will then need to support three topology descriptions instead of two. In this RFC I present both solutions, but I'd rather not keep all of it. Please see the individual patches for details: (1) Patches 1, 3-10 add support for virtio-iommu to the Linux IORT driver and patches 2, 11 add the VIOT glue. (2) Patch 12 adds the built-in topology description to the virtio-iommu specification. Patch 13 is a partial implementation for the Linux virtio-iommu driver. It only supports PCI, not platform devices. You can find Linux and QEMU code on my virtio-iommu/devel branches at http://jpbrucker.net/git/linux and http://jpbrucker.net/git/qemu I split the diffstat since there are two independent features. The first one is for patches 1-11, and the second one for patch 13. Jean-Philippe Brucker (11): ACPI/IORT: Move IORT to the ACPI folder ACPI: Add VIOT definitions ACPI/IORT: Allow registration of external tables ACPI/IORT: Add node categories ACPI/IORT: Support VIOT virtio-mmio node ACPI/IORT: Support VIOT virtio-pci node ACPI/IORT: Defer probe until virtio-iommu-pci has registered a fwnode ACPI/IORT: Add callback to update a device's fwnode iommu/virtio: Create fwnode if necessary iommu/virtio: Update IORT fwnode ACPI: Add VIOT table MAINTAINERS | 9 + drivers/acpi/Kconfig| 7 + drivers/acpi/Makefile | 2 + drivers/acpi/arm64/Kconfig | 3 - drivers/acpi/arm64/Makefile | 1 - drivers/acpi/bus.c | 2 + drivers/acpi/{arm64 => }/iort.c | 317 ++-- drivers/acpi/tables.c |
Re: [virtio-dev] [PATCH RESEND v4 1/1] Add virtio-iommu device specification
On Wed, Nov 20, 2019 at 06:27:34PM +0100, Cornelia Huck wrote: > > +\begin{lstlisting} > > +struct virtio_iommu_config { > > + le64 page_size_mask; > > + struct virtio_iommu_range_64 { > > +le64 start; > > +le64 end; > > + } input_range; > > + struct virtio_iommu_range_32 { > > +le32 start; > > +le32 end; > > + } domain_range; > > + le32 probe_size; > > +}; > > +\end{lstlisting} > > + > > +\drivernormative{\subsubsection}{Device configuration layout}{Device Types > > / IOMMU Device / Device configuration layout} > > + > > +The driver MUST NOT write to device configuration fields. > > + > > +\devicenormative{\subsubsection}{Device configuration layout}{Device Types > > / IOMMU Device / Device configuration layout} > > + > > +The device SHOULD set \field{padding} to zero. > > I don't see any field named 'padding' -- is that a leftover from an > earlier version? Oh right, it is. The previous version had an 1-byte field before probe_size, so the structure needed 3 bytes of padding. Since we replaced that field by domain_range in v4, padding is gone. > If it is, we can probably remove it as a trivial change on top after > this change went in. I can also send a new version (after waiting a few days for other comments), please let me know what you prefer. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH RESEND v4 1/1] Add virtio-iommu device specification
The IOMMU device allows a guest to manage DMA mappings for physical, emulated and paravirtualized endpoints. Add device description for the virtio-iommu device and driver. Introduce PROBE, ATTACH, DETACH, MAP and UNMAP requests, as well as translation error reporting. Fixes: https://github.com/oasis-tcs/virtio-spec/issues/37 Signed-off-by: Jean-Philippe Brucker --- conformance.tex | 40 ++- content.tex | 1 + virtio-iommu.tex | 816 +++ 3 files changed, 855 insertions(+), 2 deletions(-) create mode 100644 virtio-iommu.tex diff --git a/conformance.tex b/conformance.tex index 0ac58aa..86ccf61 100644 --- a/conformance.tex +++ b/conformance.tex @@ -15,14 +15,14 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets} \begin{itemize} \item Clause \ref{sec:Conformance / Driver Conformance}. \item One of clauses \ref{sec:Conformance / Driver Conformance / PCI Driver Conformance}, \ref{sec:Conformance / Driver Conformance / MMIO Driver Conformance} or \ref{sec:Conformance / Driver Conformance / Channel I/O Driver Conformance}. -\item One of clauses \ref{sec:Conformance / Driver Conformance / Network Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver Conformance} or \ref{sec:Conformance / Driver Conformance / Socket Driver Conformance}. +\item One of clauses \ref{sec:Conformance / Driver Conformance / Network Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Socket Driver Conformance} or \ref{sec:Conformance / Driver Conformance / IOMMU Driver Conformance}. \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}. \end{itemize} \item[Device] A device MUST conform to four conformance clauses: \begin{itemize} \item Clause \ref{sec:Conformance / Device Conformance}. \item One of clauses \ref{sec:Conformance / Device Conformance / PCI Device Conformance}, \ref{sec:Conformance / Device Conformance / MMIO Device Conformance} or \ref{sec:Conformance / Device Conformance / Channel I/O Device Conformance}. -\item One of clauses \ref{sec:Conformance / Device Conformance / Network Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device Conformance}, \ref{sec:Conformance / Device Conformance / Console Device Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto Device Conformance} or \ref{sec:Conformance / Device Conformance / Socket Device Conformance}. +\item One of clauses \ref{sec:Conformance / Device Conformance / Network Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device Conformance}, \ref{sec:Conformance / Device Conformance / Console Device Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto Device Conformance}, \ref{sec:Conformance / Device Conformance / Socket Device Conformance} or \ref{sec:Conformance / Device Conformance / IOMMU Device Conformance}. \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}. \end{itemize} \end{description} @@ -183,6 +183,24 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets} \item \ref{drivernormative:Device Types / Socket Device / Device Operation / Device Events} \end{itemize
[virtio-dev] [PATCH RESEND v4 0/1] Add virtio-iommu specification
Hi, This is the base virtio-iommu specification. There wasn't any comment on v4, but I have been away for a while so I'm resending the patch rebased onto the current master. The driver implementation has been merged in Linux, and the QEMU device should follow soon. The next step is proper multiplatform support, for which I'll send a proposal shortly. Then I'll work more on PASID and PRI support, and some much needed optimizations. I would like to request a vote for https://github.com/oasis-tcs/virtio-spec/issues/37 Thanks, Jean Jean-Philippe Brucker (1): Add virtio-iommu device specification conformance.tex | 40 ++- content.tex | 1 + virtio-iommu.tex | 816 +++ 3 files changed, 855 insertions(+), 2 deletions(-) create mode 100644 virtio-iommu.tex -- 2.24.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] [PATCH] snd: Add virtio sound device specification
On Tue, Nov 12, 2019 at 02:20:44PM +, Liam Girdwood wrote: > > Other virtio devices rely on the buffer length already provided by > > virtio > > for this. For example see the virtio-net control virtqueue > > (virtio_net_ctrl*), which is extended with new commands from time to > > time. > > Most recently, the "receive-side scaling" feature introduces a large > > structure as payload to virtio_net_ctrl. > > > > Sorry, not following, what does this have to do with combining type and > size into request_code ? Or are you saying the size is encoded > elsewhere in the virtio transport data ? Yes, the buffer size is encoded in the transport data. [...] > This would be a good improvement, it's less copying and would likely > improve user experience, however the buffer ptr still suffers latency > as it's queued (same for stream positions going the other way). Right if the queuing overhead is still too large, then I don't think the current virtqueues can be used. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] [PATCH v2] snd: Add virtio sound device specification
Hi Mikhail, On Tue, Nov 05, 2019 at 04:06:06PM +0100, Mikhail Golubev wrote: > From: Anton Yakovlev > > Hi everyone, > > the patch proposes an updated version of virtio specification for a new > virtio sound device. The initial PATCH can be found here: [1]. The > device may be useful in case when having audio is required but a device > passthrough or emulation is not an option. > > The virtio sound card supports output and input PCM streams. It reports > stream capabilities through device configuration space and utilizes two > virtqueues: one for control requests and one for I/O requests. Did you send out the implementation prototypes as well? It might help understand how the device works. [...] > +\subsection{Virtqueues}\label{sec:Device Types / Sound Device / Virtqueues} > + > +\begin{description} > +\item[0] controlq > +\item[1] pcmq > +\end{description} > + > +The controlq virtqueue always exists, the pcmq virtqueue only exists if > +the VIRTIO_SND_F_PCM_OUTPUT and/or VIRTIO_SND_F_PCM_INPUT feature is > negotiated. Why not use two queues, one for input and one for output? I think most virtio devices do it that way. It could simplify the implementations, shave 32 bits off the buffers, and allow to quiesce notifications for input and output streams independently. And in the future you can extend the device with more than two streams. > + > +\subsection{Feature bits}\label{sec:Device Types / Sound Device / Feature > bits} > + > +\begin{description} > +\item[VIRTIO_SND_F_PCM_OUTPUT (0)] Output PCM stream support. > +\item[VIRTIO_SND_F_PCM_INPUT (1)] Input PCM stream support. > +\end{description} > + > +\subsection{Device configuration layout}\label{sec:Device Types / Sound > Device / Device configuration layout} > + > +\begin{lstlisting} > +/* supported PCM sample formats */ > +enum { > +VIRTIO_SND_PCM_FMT_S8 = 0, > +VIRTIO_SND_PCM_FMT_U8, > +VIRTIO_SND_PCM_FMT_S16, > +VIRTIO_SND_PCM_FMT_U16, > +VIRTIO_SND_PCM_FMT_S32, > +VIRTIO_SND_PCM_FMT_U32, > +VIRTIO_SND_PCM_FMT_FLOAT, > +VIRTIO_SND_PCM_FMT_FLOAT64 > +}; Is it worth adding a word or two about the encoding used, in particular that this is linear quantization? Or maybe it's standard enough, I don't know much about audio. > + > +/* supported PCM frame rates */ > +enum { > +VIRTIO_SND_PCM_RATE_5512 = 0, > +VIRTIO_SND_PCM_RATE_8000, > +VIRTIO_SND_PCM_RATE_11025, > +VIRTIO_SND_PCM_RATE_16000, > +VIRTIO_SND_PCM_RATE_22050, > +VIRTIO_SND_PCM_RATE_32000, > +VIRTIO_SND_PCM_RATE_44100, > +VIRTIO_SND_PCM_RATE_48000, > +VIRTIO_SND_PCM_RATE_64000, > +VIRTIO_SND_PCM_RATE_88200, > +VIRTIO_SND_PCM_RATE_96000, > +VIRTIO_SND_PCM_RATE_176400, > +VIRTIO_SND_PCM_RATE_192000, > +VIRTIO_SND_PCM_RATE_384000 > +}; > + > +/* a PCM stream configuration */ > +struct virtio_pcm_stream_config { > +u8 channels_min; > +u8 channels_max; > +le16 formats; /* 1 << VIRTIO_SND_PCM_FMT_XXX */ > +le16 rates; /* 1 << VIRTIO_SND_PCM_RATE_XXX */ > +u16 padding; You could add a lot more padding, because the current layout doesn't allow to extend virtio_pcm_stream_config much (an old driver will expect the pcm.input field to start at offset 0x8 of the config space). > +}; > + > +/* a device configuration space */ > +struct virtio_snd_config { > +struct virtio_pcm_config { > +struct virtio_pcm_stream_config output; > +struct virtio_pcm_stream_config input; > +} pcm; > +}; > +\end{lstlisting} > + > +\subsubsection{Device configuration fields} > + > +The \field{pcm.output} and \field{pcm.input} fields contain PCM stream > +configuration: > + > +\begin{description} > +\item[\field{channels_min}] (driver-read-only) is a minimum number of > supported > +channels. > +\item[\field{channels_max}] (driver-read-only) is a maximum number of > supported > +channels. > +\item[\field{formats}] (driver-read-only) is a supported sample format bit > map. > +\item[\field{rates}] (driver-read-only) is a supported frame rate bit map. > +\end{description} > + > +Only interleaved samples are supported. Similar noob remark: I guess this is standard in audio hardware and doesn't need more explanation? If the term is alsa-specific then we'd need a few words about what the format looks like. > + > +\subsection{Device Initialization} > + > +\begin{enumerate} > +\item If the VIRTIO_SND_F_PCM_OUTPUT feature is negotiated, > \field{pcm.output} > +contains the valid output PCM stream configuration. > +\item If the VIRTIO_SND_F_PCM_INPUT feature is negotiated, \field{pcm.input} > +contains the valid input PCM stream configuration. > +\end{enumerate} > + > +\devicenormative{\subsubsection}{Device Initialization}{Device Types / Sound > Device / Device Initialization} > + > +\begin{enumerate} > +\item The device MUST NOT set the not defined format and rate bits. s/not defined/undefined/ (here and elsewhere) > +\item The device MUST initialize padding bytes
Re: [virtio-dev] [PATCH] snd: Add virtio sound device specification
Hi, On Mon, Nov 11, 2019 at 03:20:55PM +, Liam Girdwood wrote: > > > > +struct virtio_snd_ctl_msg { > > > > +/* device-read-only data */ > > > > +le32 request_code; > > > > +u8 request_payload[]; > > > > > > How do I know how much data to read here ? Is size embedded in > > > request_code ? > > > > Yes, you are right. Actual request/response length can be easily > > derived from > > a request_code. > > > > Nak, please use a separate size and type code otherwise we will have > real problems in the future when we come to add new features (with new > codes or bigger sizes). Other virtio devices rely on the buffer length already provided by virtio for this. For example see the virtio-net control virtqueue (virtio_net_ctrl*), which is extended with new commands from time to time. Most recently, the "receive-side scaling" feature introduces a large structure as payload to virtio_net_ctrl. [...] > > > > +\begin{lstlisting} > > > > +struct virtio_snd_pcm_xfer { > > > > +le32 stream; > > > > +u8 data[]; > > > > +le32 status; > > > > +le32 actual_length; > > > > > > Not following this, is actual_length the size of data[]. If so, it > > > must > > > preceed data[]. > > > > No, the actual_length field is supposed to be used by device side to > > report > > actual amount of bytes read from/written to a buffer. By the way the written size is already provided by virtio (field len in the used descriptor), but not the read size. > > In real world > > scenario, > > if an I/O request contains N bytes, a device can play/capture *up to* > > N bytes. > > Thus, it's required to report this length back to a driver. > > > > This is not how PCM audio buffering works. I don't copy extra padding > if I dont need to, I only copy when my buffer is full. See the need for > period size in earlier comments. [...] > > > 2) Zero copy of audio data and stream positions. Essential for any > > > low > > > latency audio use cases (if supported by HW, hypervisor, VM > > > framework). > > > > What do you mean by zero-copy? Shared memory between device and > > driver sides? > > Yes, but I see this is now a separate thread as it used by others too. The virtio transport already permits zero-copy, because one buffer can be described with a chain of descriptors, each pointing to a different area in memory. The driver can split the virtio_snd_pcm_xfer into header and data: virtqueue desc ring | | +---+ | |-> header (copied) +- - - -+ | |-> PCM data (mapped) +---+ | | So it might be a good idea to put all the metadata (stream, status, actual_length) at the beginning of the virtio_snd_pcm_xfer struct, otherwise the driver will need three descriptors instead of two to achieve this. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] Re: [PATCH v8 2/7] dt-bindings: virtio: Add virtio-pci-iommu node
On 30/05/2019 18:45, Michael S. Tsirkin wrote: > On Thu, May 30, 2019 at 06:09:24PM +0100, Jean-Philippe Brucker wrote: >> Some systems implement virtio-iommu as a PCI endpoint. The operating >> system needs to discover the relationship between IOMMU and masters long >> before the PCI endpoint gets probed. Add a PCI child node to describe the >> virtio-iommu device. >> >> The virtio-pci-iommu is conceptually split between a PCI programming >> interface and a translation component on the parent bus. The latter >> doesn't have a node in the device tree. The virtio-pci-iommu node >> describes both, by linking the PCI endpoint to "iommus" property of DMA >> master nodes and to "iommu-map" properties of bus nodes. >> >> Reviewed-by: Rob Herring >> Reviewed-by: Eric Auger >> Signed-off-by: Jean-Philippe Brucker > > So this is just an example right? > We are not defining any new properties or anything like that. Yes it's just an example. The properties already exist but it's good to describe how to put them together for this particular case, because there isn't a precedent describing the topology for an IOMMU that appears on the PCI bus. > I think down the road for non dt platforms we want to put this > info in the config space of the device. I do not think ACPI > is the best option for this since not all systems have it. > But that can wait. There is the probe order problem - PCI needs this info before starting to probe devices on the bus. Maybe we could store the info in a separate memory region, that is referenced on the command-line and that the guest can read early. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v8 4/7] PCI: OF: Initialize dev->fwnode appropriately
For PCI devices that have an OF node, set the fwnode as well. This way drivers that rely on fwnode don't need the special case described by commit f94277af03ea ("of/platform: Initialise dev->fwnode appropriately"). Acked-by: Bjorn Helgaas Signed-off-by: Jean-Philippe Brucker --- drivers/pci/of.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/pci/of.c b/drivers/pci/of.c index 73d5adec0a28..c4f1b5507b40 100644 --- a/drivers/pci/of.c +++ b/drivers/pci/of.c @@ -22,12 +22,15 @@ void pci_set_of_node(struct pci_dev *dev) return; dev->dev.of_node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn); + if (dev->dev.of_node) + dev->dev.fwnode = >dev.of_node->fwnode; } void pci_release_of_node(struct pci_dev *dev) { of_node_put(dev->dev.of_node); dev->dev.of_node = NULL; + dev->dev.fwnode = NULL; } void pci_set_bus_of_node(struct pci_bus *bus) @@ -42,12 +45,15 @@ void pci_set_bus_of_node(struct pci_bus *bus) bus->self->untrusted = true; } bus->dev.of_node = node; + if (node) + bus->dev.fwnode = >fwnode; } void pci_release_bus_of_node(struct pci_bus *bus) { of_node_put(bus->dev.of_node); bus->dev.of_node = NULL; + bus->dev.fwnode = NULL; } struct device_node * __weak pcibios_get_phb_of_node(struct pci_bus *bus) -- 2.21.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v8 6/7] iommu/virtio: Add probe request
When the device offers the probe feature, send a probe request for each device managed by the IOMMU. Extract RESV_MEM information. When we encounter a MSI doorbell region, set it up as a IOMMU_RESV_MSI region. This will tell other subsystems that there is no need to map the MSI doorbell in the virtio-iommu, because MSIs bypass it. Acked-by: Joerg Roedel Reviewed-by: Eric Auger Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/virtio-iommu.c | 157 -- include/uapi/linux/virtio_iommu.h | 36 +++ 2 files changed, 187 insertions(+), 6 deletions(-) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index b2719a87c3c5..5d4947c47420 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -49,6 +49,7 @@ struct viommu_dev { u32 last_domain; /* Supported MAP flags */ u32 map_flags; + u32 probe_size; }; struct viommu_mapping { @@ -71,8 +72,10 @@ struct viommu_domain { }; struct viommu_endpoint { + struct device *dev; struct viommu_dev *viommu; struct viommu_domain*vdomain; + struct list_headresv_regions; }; struct viommu_request { @@ -125,6 +128,9 @@ static off_t viommu_get_write_desc_offset(struct viommu_dev *viommu, { size_t tail_size = sizeof(struct virtio_iommu_req_tail); + if (req->type == VIRTIO_IOMMU_T_PROBE) + return len - viommu->probe_size - tail_size; + return len - tail_size; } @@ -399,6 +405,110 @@ static int viommu_replay_mappings(struct viommu_domain *vdomain) return ret; } +static int viommu_add_resv_mem(struct viommu_endpoint *vdev, + struct virtio_iommu_probe_resv_mem *mem, + size_t len) +{ + size_t size; + u64 start64, end64; + phys_addr_t start, end; + struct iommu_resv_region *region = NULL; + unsigned long prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO; + + start = start64 = le64_to_cpu(mem->start); + end = end64 = le64_to_cpu(mem->end); + size = end64 - start64 + 1; + + /* Catch any overflow, including the unlikely end64 - start64 + 1 = 0 */ + if (start != start64 || end != end64 || size < end64 - start64) + return -EOVERFLOW; + + if (len < sizeof(*mem)) + return -EINVAL; + + switch (mem->subtype) { + default: + dev_warn(vdev->dev, "unknown resv mem subtype 0x%x\n", +mem->subtype); + /* Fall-through */ + case VIRTIO_IOMMU_RESV_MEM_T_RESERVED: + region = iommu_alloc_resv_region(start, size, 0, +IOMMU_RESV_RESERVED); + break; + case VIRTIO_IOMMU_RESV_MEM_T_MSI: + region = iommu_alloc_resv_region(start, size, prot, +IOMMU_RESV_MSI); + break; + } + if (!region) + return -ENOMEM; + + list_add(>resv_regions, >list); + return 0; +} + +static int viommu_probe_endpoint(struct viommu_dev *viommu, struct device *dev) +{ + int ret; + u16 type, len; + size_t cur = 0; + size_t probe_len; + struct virtio_iommu_req_probe *probe; + struct virtio_iommu_probe_property *prop; + struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); + struct viommu_endpoint *vdev = fwspec->iommu_priv; + + if (!fwspec->num_ids) + return -EINVAL; + + probe_len = sizeof(*probe) + viommu->probe_size + + sizeof(struct virtio_iommu_req_tail); + probe = kzalloc(probe_len, GFP_KERNEL); + if (!probe) + return -ENOMEM; + + probe->head.type = VIRTIO_IOMMU_T_PROBE; + /* +* For now, assume that properties of an endpoint that outputs multiple +* IDs are consistent. Only probe the first one. +*/ + probe->endpoint = cpu_to_le32(fwspec->ids[0]); + + ret = viommu_send_req_sync(viommu, probe, probe_len); + if (ret) + goto out_free; + + prop = (void *)probe->properties; + type = le16_to_cpu(prop->type) & VIRTIO_IOMMU_PROBE_T_MASK; + + while (type != VIRTIO_IOMMU_PROBE_T_NONE && + cur < viommu->probe_size) { + len = le16_to_cpu(prop->length) + sizeof(*prop); + + switch (type) { + case VIRTIO_IOMMU_PROBE_T_RESV_MEM: + ret = viommu_add_resv_mem(vdev, (void *)prop, len); + break; + default: + dev_err(dev, "unknown viommu prop 0x%x\n", type); +
[virtio-dev] [PATCH v8 7/7] iommu/virtio: Add event queue
The event queue offers a way for the device to report access faults from endpoints. It is implemented on virtqueue #1. Whenever the host needs to signal a fault, it fills one of the buffers offered by the guest and interrupts it. Acked-by: Joerg Roedel Reviewed-by: Eric Auger Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/virtio-iommu.c | 115 +++--- include/uapi/linux/virtio_iommu.h | 19 + 2 files changed, 125 insertions(+), 9 deletions(-) diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c index 5d4947c47420..2688cdcac6e5 100644 --- a/drivers/iommu/virtio-iommu.c +++ b/drivers/iommu/virtio-iommu.c @@ -29,7 +29,8 @@ #define MSI_IOVA_LENGTH0x10 #define VIOMMU_REQUEST_VQ 0 -#define VIOMMU_NR_VQS 1 +#define VIOMMU_EVENT_VQ1 +#define VIOMMU_NR_VQS 2 struct viommu_dev { struct iommu_device iommu; @@ -41,6 +42,7 @@ struct viommu_dev { struct virtqueue*vqs[VIOMMU_NR_VQS]; spinlock_t request_lock; struct list_headrequests; + void*evts; /* Device configuration */ struct iommu_domain_geometrygeometry; @@ -86,6 +88,15 @@ struct viommu_request { charbuf[]; }; +#define VIOMMU_FAULT_RESV_MASK 0xff00 + +struct viommu_event { + union { + u32 head; + struct virtio_iommu_fault fault; + }; +}; + #define to_viommu_domain(domain) \ container_of(domain, struct viommu_domain, domain) @@ -509,6 +520,68 @@ static int viommu_probe_endpoint(struct viommu_dev *viommu, struct device *dev) return ret; } +static int viommu_fault_handler(struct viommu_dev *viommu, + struct virtio_iommu_fault *fault) +{ + char *reason_str; + + u8 reason = fault->reason; + u32 flags = le32_to_cpu(fault->flags); + u32 endpoint= le32_to_cpu(fault->endpoint); + u64 address = le64_to_cpu(fault->address); + + switch (reason) { + case VIRTIO_IOMMU_FAULT_R_DOMAIN: + reason_str = "domain"; + break; + case VIRTIO_IOMMU_FAULT_R_MAPPING: + reason_str = "page"; + break; + case VIRTIO_IOMMU_FAULT_R_UNKNOWN: + default: + reason_str = "unknown"; + break; + } + + /* TODO: find EP by ID and report_iommu_fault */ + if (flags & VIRTIO_IOMMU_FAULT_F_ADDRESS) + dev_err_ratelimited(viommu->dev, "%s fault from EP %u at %#llx [%s%s%s]\n", + reason_str, endpoint, address, + flags & VIRTIO_IOMMU_FAULT_F_READ ? "R" : "", + flags & VIRTIO_IOMMU_FAULT_F_WRITE ? "W" : "", + flags & VIRTIO_IOMMU_FAULT_F_EXEC ? "X" : ""); + else + dev_err_ratelimited(viommu->dev, "%s fault from EP %u\n", + reason_str, endpoint); + return 0; +} + +static void viommu_event_handler(struct virtqueue *vq) +{ + int ret; + unsigned int len; + struct scatterlist sg[1]; + struct viommu_event *evt; + struct viommu_dev *viommu = vq->vdev->priv; + + while ((evt = virtqueue_get_buf(vq, )) != NULL) { + if (len > sizeof(*evt)) { + dev_err(viommu->dev, + "invalid event buffer (len %u != %zu)\n", + len, sizeof(*evt)); + } else if (!(evt->head & VIOMMU_FAULT_RESV_MASK)) { + viommu_fault_handler(viommu, >fault); + } + + sg_init_one(sg, evt, sizeof(*evt)); + ret = virtqueue_add_inbuf(vq, sg, 1, evt, GFP_ATOMIC); + if (ret) + dev_err(viommu->dev, "could not add event buffer\n"); + } + + virtqueue_kick(vq); +} + /* IOMMU API */ static struct iommu_domain *viommu_domain_alloc(unsigned type) @@ -895,16 +968,35 @@ static struct iommu_ops viommu_ops = { static int viommu_init_vqs(struct viommu_dev *viommu) { struct virtio_device *vdev = dev_to_virtio(viommu->dev); - const char *name = "request"; - void *ret; + const char *names[] = { "request", "event" }; + vq_callback_t *callbacks[] = { + NULL, /* No async requests */ + viommu_event_handler, + }; - ret = virtio_find_single_vq(vde
[virtio-dev] [PATCH v8 2/7] dt-bindings: virtio: Add virtio-pci-iommu node
Some systems implement virtio-iommu as a PCI endpoint. The operating system needs to discover the relationship between IOMMU and masters long before the PCI endpoint gets probed. Add a PCI child node to describe the virtio-iommu device. The virtio-pci-iommu is conceptually split between a PCI programming interface and a translation component on the parent bus. The latter doesn't have a node in the device tree. The virtio-pci-iommu node describes both, by linking the PCI endpoint to "iommus" property of DMA master nodes and to "iommu-map" properties of bus nodes. Reviewed-by: Rob Herring Reviewed-by: Eric Auger Signed-off-by: Jean-Philippe Brucker --- .../devicetree/bindings/virtio/iommu.txt | 66 +++ 1 file changed, 66 insertions(+) create mode 100644 Documentation/devicetree/bindings/virtio/iommu.txt diff --git a/Documentation/devicetree/bindings/virtio/iommu.txt b/Documentation/devicetree/bindings/virtio/iommu.txt new file mode 100644 index ..2407fea0651c --- /dev/null +++ b/Documentation/devicetree/bindings/virtio/iommu.txt @@ -0,0 +1,66 @@ +* virtio IOMMU PCI device + +When virtio-iommu uses the PCI transport, its programming interface is +discovered dynamically by the PCI probing infrastructure. However the +device tree statically describes the relation between IOMMU and DMA +masters. Therefore, the PCI root complex that hosts the virtio-iommu +contains a child node representing the IOMMU device explicitly. + +Required properties: + +- compatible: Should be "virtio,pci-iommu" +- reg: PCI address of the IOMMU. As defined in the PCI Bus + Binding reference [1], the reg property is a five-cell + address encoded as (phys.hi phys.mid phys.lo size.hi + size.lo). phys.hi should contain the device's BDF as + 0b dfff . The other cells + should be zero. +- #iommu-cells:Each platform DMA master managed by the IOMMU is assigned + an endpoint ID, described by the "iommus" property [2]. + For virtio-iommu, #iommu-cells must be 1. + +Notes: + +- DMA from the IOMMU device isn't managed by another IOMMU. Therefore the + virtio-iommu node doesn't have an "iommus" property, and is omitted from + the iommu-map property of the root complex. + +Example: + +pcie@1000 { + compatible = "pci-host-ecam-generic"; + ... + + /* The IOMMU programming interface uses slot 00:01.0 */ + iommu0: iommu@0008 { + compatible = "virtio,pci-iommu"; + reg = <0x0800 0 0 0 0>; + #iommu-cells = <1>; + }; + + /* +* The IOMMU manages all functions in this PCI domain except +* itself. Omit BDF 00:01.0. +*/ + iommu-map = <0x0 0x0 0x8> + <0x9 0x9 0xfff7>; +}; + +pcie@2000 { + compatible = "pci-host-ecam-generic"; + ... + /* +* The IOMMU also manages all functions from this domain, +* with endpoint IDs 0x1 - 0x1 +*/ + iommu-map = <0x0 0x1 0x1>; +}; + +ethernet@fe001000 { + ... + /* The IOMMU manages this platform device with endpoint ID 0x2 */ + iommus = < 0x2>; +}; + +[1] Documentation/devicetree/bindings/pci/pci.txt +[2] Documentation/devicetree/bindings/iommu/iommu.txt -- 2.21.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v8 5/7] iommu: Add virtio-iommu driver
The virtio IOMMU is a para-virtualized device, allowing to send IOMMU requests such as map/unmap over virtio transport without emulating page tables. This implementation handles ATTACH, DETACH, MAP and UNMAP requests. The bulk of the code transforms calls coming from the IOMMU API into corresponding virtio requests. Mappings are kept in an interval tree instead of page tables. A little more work is required for modular and x86 support, so for the moment the driver depends on CONFIG_VIRTIO=y and CONFIG_ARM64. Acked-by: Joerg Roedel Signed-off-by: Jean-Philippe Brucker --- MAINTAINERS | 7 + drivers/iommu/Kconfig | 11 + drivers/iommu/Makefile| 1 + drivers/iommu/virtio-iommu.c | 934 ++ include/uapi/linux/virtio_ids.h | 1 + include/uapi/linux/virtio_iommu.h | 110 6 files changed, 1064 insertions(+) create mode 100644 drivers/iommu/virtio-iommu.c create mode 100644 include/uapi/linux/virtio_iommu.h diff --git a/MAINTAINERS b/MAINTAINERS index 429c6c624861..62bd1834d95a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16807,6 +16807,13 @@ S: Maintained F: drivers/virtio/virtio_input.c F: include/uapi/linux/virtio_input.h +VIRTIO IOMMU DRIVER +M: Jean-Philippe Brucker +L: virtualizat...@lists.linux-foundation.org +S: Maintained +F: drivers/iommu/virtio-iommu.c +F: include/uapi/linux/virtio_iommu.h + VIRTUAL BOX GUEST DEVICE DRIVER M: Hans de Goede M: Arnd Bergmann diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 83664db5221d..e15cdcd8cb3c 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -473,4 +473,15 @@ config HYPERV_IOMMU Stub IOMMU driver to handle IRQs as to allow Hyper-V Linux guests to run with x2APIC mode enabled. +config VIRTIO_IOMMU + bool "Virtio IOMMU driver" + depends on VIRTIO=y + depends on ARM64 + select IOMMU_API + select INTERVAL_TREE + help + Para-virtualised IOMMU driver with virtio. + + Say Y here if you intend to run this kernel as a guest. + endif # IOMMU_SUPPORT diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index 8c71a15e986b..f13f36ae1af6 100644 --- a/drivers/iommu/Makefile +++ b/drivers/iommu/Makefile @@ -33,3 +33,4 @@ obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o obj-$(CONFIG_S390_IOMMU) += s390-iommu.o obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o +obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c new file mode 100644 index ..b2719a87c3c5 --- /dev/null +++ b/drivers/iommu/virtio-iommu.c @@ -0,0 +1,934 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Virtio driver for the paravirtualized IOMMU + * + * Copyright (C) 2019 Arm Limited + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#define MSI_IOVA_BASE 0x800 +#define MSI_IOVA_LENGTH0x10 + +#define VIOMMU_REQUEST_VQ 0 +#define VIOMMU_NR_VQS 1 + +struct viommu_dev { + struct iommu_device iommu; + struct device *dev; + struct virtio_device*vdev; + + struct ida domain_ids; + + struct virtqueue*vqs[VIOMMU_NR_VQS]; + spinlock_t request_lock; + struct list_headrequests; + + /* Device configuration */ + struct iommu_domain_geometrygeometry; + u64 pgsize_bitmap; + u32 first_domain; + u32 last_domain; + /* Supported MAP flags */ + u32 map_flags; +}; + +struct viommu_mapping { + phys_addr_t paddr; + struct interval_tree_node iova; + u32 flags; +}; + +struct viommu_domain { + struct iommu_domain domain; + struct viommu_dev *viommu; + struct mutexmutex; /* protects viommu pointer */ + unsigned intid; + u32 map_flags; + + spinlock_t mappings_lock; + struct rb_root_cached mappings; + + unsigned long nr_endpoints; +}; + +struct viommu_endpoint { + struct viommu_dev *viommu; + struct viommu_domain*vdomain; +}; + +struct viommu_request { + struct list_headlist; + void*writeback; + unsigned int
[virtio-dev] [PATCH v8 1/7] dt-bindings: virtio-mmio: Add IOMMU description
The nature of a virtio-mmio node is discovered by the virtio driver at probe time. However the DMA relation between devices must be described statically. When a virtio-mmio node is a virtio-iommu device, it needs an "#iommu-cells" property as specified by bindings/iommu/iommu.txt. Otherwise, the virtio-mmio device may perform DMA through an IOMMU, which requires an "iommus" property. Describe these requirements in the device-tree bindings documentation. Reviewed-by: Rob Herring Reviewed-by: Eric Auger Signed-off-by: Jean-Philippe Brucker --- .../devicetree/bindings/virtio/mmio.txt | 30 +++ 1 file changed, 30 insertions(+) diff --git a/Documentation/devicetree/bindings/virtio/mmio.txt b/Documentation/devicetree/bindings/virtio/mmio.txt index 5069c1b8e193..21af30fbb81f 100644 --- a/Documentation/devicetree/bindings/virtio/mmio.txt +++ b/Documentation/devicetree/bindings/virtio/mmio.txt @@ -8,10 +8,40 @@ Required properties: - reg: control registers base address and size including configuration space - interrupts: interrupt generated by the device +Required properties for virtio-iommu: + +- #iommu-cells:When the node corresponds to a virtio-iommu device, it is + linked to DMA masters using the "iommus" or "iommu-map" + properties [1][2]. #iommu-cells specifies the size of the + "iommus" property. For virtio-iommu #iommu-cells must be + 1, each cell describing a single endpoint ID. + +Optional properties: + +- iommus: If the device accesses memory through an IOMMU, it should + have an "iommus" property [1]. Since virtio-iommu itself + does not access memory through an IOMMU, the "virtio,mmio" + node cannot have both an "#iommu-cells" and an "iommus" + property. + Example: virtio_block@3000 { compatible = "virtio,mmio"; reg = <0x3000 0x100>; interrupts = <41>; + + /* Device has endpoint ID 23 */ + iommus = < 23> } + + viommu: iommu@3100 { + compatible = "virtio,mmio"; + reg = <0x3100 0x100>; + interrupts = <42>; + + #iommu-cells = <1> + } + +[1] Documentation/devicetree/bindings/iommu/iommu.txt +[2] Documentation/devicetree/bindings/pci/pci-iommu.txt -- 2.21.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v8 0/7] Add virtio-iommu driver
Implement the virtio-iommu driver, following specification v0.12 [1]. Since last version [2] we've worked on improving the specification, which resulted in the following changes to the interface: * Remove the EXEC flag. * Add feature bit for the MMIO flag. * Change domain_bits to domain_range. Given that there were small changes to patch 5/7, I removed the review and test tags. Please find the code at [3]. [1] Virtio-iommu specification v0.12, sources and pdf git://linux-arm.org/virtio-iommu.git virtio-iommu/v0.12 http://jpbrucker.net/virtio-iommu/spec/v0.12/virtio-iommu-v0.12.pdf http://jpbrucker.net/virtio-iommu/spec/diffs/virtio-iommu-dev-diff-v0.11-v0.12.pdf [2] [PATCH v7 0/7] Add virtio-iommu driver https://lore.kernel.org/linux-pci/0ba215f5-e856-bf31-8dd9-a85710714...@arm.com/T/ [3] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.12 git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.12 Jean-Philippe Brucker (7): dt-bindings: virtio-mmio: Add IOMMU description dt-bindings: virtio: Add virtio-pci-iommu node of: Allow the iommu-map property to omit untranslated devices PCI: OF: Initialize dev->fwnode appropriately iommu: Add virtio-iommu driver iommu/virtio: Add probe request iommu/virtio: Add event queue .../devicetree/bindings/virtio/iommu.txt | 66 + .../devicetree/bindings/virtio/mmio.txt | 30 + MAINTAINERS |7 + drivers/iommu/Kconfig | 11 + drivers/iommu/Makefile|1 + drivers/iommu/virtio-iommu.c | 1176 + drivers/of/base.c | 10 +- drivers/pci/of.c |6 + include/uapi/linux/virtio_ids.h |1 + include/uapi/linux/virtio_iommu.h | 165 +++ 10 files changed, 1470 insertions(+), 3 deletions(-) create mode 100644 Documentation/devicetree/bindings/virtio/iommu.txt create mode 100644 drivers/iommu/virtio-iommu.c create mode 100644 include/uapi/linux/virtio_iommu.h -- 2.21.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v8 3/7] of: Allow the iommu-map property to omit untranslated devices
In PCI root complex nodes, the iommu-map property describes the IOMMU that translates each endpoint. On some platforms, the IOMMU itself is presented as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). This isn't supported by the current OF driver, which expects all endpoints to have an IOMMU. Allow the iommu-map property to have gaps. Relaxing of_map_rid() also allows the msi-map property to have gaps, which is invalid since MSIs always reach an MSI controller. In that case pci_msi_setup_msi_irqs() will return an error when attempting to find the device's MSI domain. Reviewed-by: Rob Herring Signed-off-by: Jean-Philippe Brucker --- drivers/of/base.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/of/base.c b/drivers/of/base.c index 20e0e7ee4edf..55e7f5bb0549 100644 --- a/drivers/of/base.c +++ b/drivers/of/base.c @@ -2294,8 +2294,12 @@ int of_map_rid(struct device_node *np, u32 rid, return 0; } - pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n", - np, map_name, rid, target && *target ? *target : NULL); - return -EFAULT; + pr_info("%pOF: no %s translation for rid 0x%x on %pOF\n", np, map_name, + rid, target && *target ? *target : NULL); + + /* Bypasses translation */ + if (id_out) + *id_out = rid; + return 0; } EXPORT_SYMBOL_GPL(of_map_rid); -- 2.21.0 - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
[virtio-dev] [PATCH v4] Add virtio-iommu device specification
The IOMMU device allows a guest to manage DMA mappings for physical, emulated and paravirtualized endpoints. Add device description for the virtio-iommu device and driver. Introduce PROBE, ATTACH, DETACH, MAP and UNMAP requests, as well as translation error reporting. Fixes: https://github.com/oasis-tcs/virtio-spec/issues/37 Signed-off-by: Jean-Philippe Brucker --- This version addresses comments on v3 [1]: * remove VIRTIO_IOMMU_MAP_F_EXEC for now, * add VIRTIO_IOMMU_F_MMIO feature bit, * change domain_bits -> domain_range, * and countless wording improvements. To help with the review, there is a diff since last version at [2] and the current version at [3]. I'll send an updated version of the driver to the list this week. [1] https://lists.oasis-open.org/archives/virtio-comment/201904/msg8.html [2] http://jpbrucker.net/virtio-iommu/spec/diffs/virtio-iommu-dev-diff-v0.11-v0.12.pdf [3] http://jpbrucker.net/virtio-iommu/spec/virtio-v1.1-wd02-iommu-0.12-draft.pdf --- conformance.tex | 40 ++- content.tex | 1 + virtio-iommu.tex | 816 +++ 3 files changed, 855 insertions(+), 2 deletions(-) create mode 100644 virtio-iommu.tex diff --git a/conformance.tex b/conformance.tex index 42f702a..79a3e7d 100644 --- a/conformance.tex +++ b/conformance.tex @@ -15,14 +15,14 @@ \section{Conformance Targets}\label{sec:Conformance / Conformance Targets} \begin{itemize} \item Clause \ref{sec:Conformance / Driver Conformance}. \item One of clauses \ref{sec:Conformance / Driver Conformance / PCI Driver Conformance}, \ref{sec:Conformance / Driver Conformance / MMIO Driver Conformance} or \ref{sec:Conformance / Driver Conformance / Channel I/O Driver Conformance}. -\item One of clauses \ref{sec:Conformance / Driver Conformance / Network Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver Conformance} or \ref{sec:Conformance / Driver Conformance / Socket Driver Conformance}. +\item One of clauses \ref{sec:Conformance / Driver Conformance / Network Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Socket Driver Conformance} or \ref{sec:Conformance / Driver Conformance / IOMMU Driver Conformance}. \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device and Transitional Driver Conformance}. \end{itemize} \item[Device] A device MUST conform to four conformance clauses: \begin{itemize} \item Clause \ref{sec:Conformance / Device Conformance}. \item One of clauses \ref{sec:Conformance / Device Conformance / PCI Device Conformance}, \ref{sec:Conformance / Device Conformance / MMIO Device Conformance} or \ref{sec:Conformance / Device Conformance / Channel I/O Device Conformance}. -\item One of clauses \ref{sec:Conformance / Device Conformance / Network Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device Conformance}, \ref{sec:Conformance / Device Conformance / Console Device Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto Device Conformance} or \ref{sec:Conformance / Device Conformance / Socket Device Conformance}. +\item One of clauses \ref{sec:Conformance / Device Conformance / Network Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device Conformance}, \ref{sec:Conformance / Device Conformance / Console Device Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI Host Device Conformance}, \ref{sec:Conformance / Dev
[virtio-dev] Re: [PATCH v7 0/7] Add virtio-iommu driver
On 27/05/2019 16:15, Michael S. Tsirkin wrote: > On Mon, May 27, 2019 at 11:26:04AM +0200, Joerg Roedel wrote: >> On Sun, May 12, 2019 at 12:31:59PM -0400, Michael S. Tsirkin wrote: >>> OK this has been in next for a while. >>> >>> Last time IOMMU maintainers objected. Are objections >>> still in force? >>> >>> If not could we get acks please? >> >> No objections against the code, I only hesitated because the Spec was >> not yet official. >> >> So for the code: >> >> Acked-by: Joerg Roedel > > Last spec patch had a bunch of comments not yet addressed. > But I do not remember whether comments are just about wording > or about the host/guest interface as well. > Jean-Philippe could you remind me please? It's mostly wording, but there is a small change in the config space layout and two new feature bits. I'll send a new version of the driver when possible. Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] Re: [PATCH v3] Add virtio-iommu device specification
On 15/05/2019 06:30, Tian, Kevin wrote: >> From: Tian, Kevin >> Sent: Wednesday, May 15, 2019 1:08 PM >> To: Jean-Philippe Brucker ; Michael S. >> Tsirkin >>> >>> >>>>>>> +\begin{description} >>>>>>> + \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)] >>>>>>> +Accesses to virtual addresses in this region have undefined >> behavior. >>>>>> >>>>>> undefined? how can platforms with such a property coexist with >>> untrusted >>>>>> devices? >>>>> >>>>> It is "undefined" because I can't enumerate all cases (don't know about >>>>> all of them). This would contain for example the RMRR regions of VT-d >>>>> (the BIOS reserving some DMA regions for itself), regions allocated for >>>>> host-managed MSIs on arm platforms, and some PCI bridge windows >>>>> (although they removed those from the Linux resv regions recently, not >>>>> certain why). Ideally we wouldn't have any, but some special cases make >>>>> it necessary. > > +Alex. > > Actually I don't think RMRR is right example here (sorry if it's my input in > our previous discussion). There was concern on RMRR providing a shared > memory region between device and platform controller, thus may cause > isolation issue: > https://access.redhat.com/sites/default/files/attachments/rmrr-wp1.pdf > > So the basic policy for devices with RMRR is, excluding them from device > assignment, except for Intel GPU and USB. The latter two has well-understood > behavior on RMRR, thus they can be safely assigned with RMRR ignored > (i.e. RMRR region can be reused as valid IOVA region). Ah I see, seems like a good policy. > As I commented below, I think it's safe to describe reserved region behavior > as either succeed with undefined behavior (e.g. in ARM MSI doorbell case, > where a spurious interrupt on allocated doorbell for this device is triggered > instead of desired DMA access), or cause vIOMMU unrecoverable fault due > to invalid mappings. An untrusted device can never use reserved region to > escape since physically IOMMU only contains verified mapping to granted > resource (MSI) or nothing. Yes, I will add that as a requirement in next version: reserved regions are okay as long as they don't allow a device to escape isolation. Finding the right wording without being too vague is a bit tough for this one. >>>>> >>>>> In general having reserved regions is incompatible with untrusted >>>>> devices, and in some cases incompatible with untrusted guests. But >>>>> choosing the policy about which devices to present to guest is up to the >>>>> platform. The host knows what the resv regions are used for, and if they >>>>> are safe enough. The guest just need to knows what to avoid. >>>> >>>> Yes but ... if you trust driver and device then what's the >>>> point of the iommu? >>> >>> Improving memory allocation. The guest can do scatter-gather for large >>> buffers, instead of allocating big contiguous chunks of guest-physical >>> addresses. And with device pass-through, any memory used for DMA has to >>> be pinned (since devices generally don't support page faults). So when >>> assigning an endpoint to a guest, without a vIOMMU all of the guest >>> memory has to be pinned upfront. With an IOMMU only the memory that >> will >>> actually be used for DMA is pinned. >> >> I'd not vote using current vIOMMU just for avoiding pin instead of for >> isolation purpose... the map/unmap overhead is high except you only >> talking about static mapping e.g. in guest user space driver (e.g. DPDK) >> (but then it's actually for isolation purpose between user and kernel). >> btw we are working on a lighter-weight approach (aim <5% perf drop >> for most cases) while allowing host to fully introspect guest DMA activities. >> Likely we'll introduce a new virtio-iommu capability for this purpose, >> instead of using current MAP/UNMAP primitives. Stay tuned and we'll >> send out RFC soon once getting good result. :-) Sounds great, I'd like to hear more about this. I also have some drafts for performance improvement and SVA, but haven't found time to progress on that in a while. For SVA however, introspection seems nearly impossible (on Arm the host may not even see invalidation commands from the guest). Thanks, Jean >> >> And I don't think having reserved regions is incompatible with untru
Re: [virtio-dev] Re: [PATCH v3] Add virtio-iommu device specification
On 10/05/2019 17:52, Michael S. Tsirkin wrote: +/* Flags are: */ +#define VIRTIO_IOMMU_MAP_F_READ (1 << 0) +#define VIRTIO_IOMMU_MAP_F_WRITE (1 << 1) +#define VIRTIO_IOMMU_MAP_F_EXEC (1 << 2) >>> >>> what is exec? pls add some comments near all flags. >> >> Exec means instruction fetch. Some systems have different transaction >> flags for read and exec (PCI only supports that in the PASID TLP prefix) >> and some page tables have a "no exec" bit. > > So let's say I don't set EXEC on such a system. Is EXEC allowed or not? > > Maybe we can ask user to always set EXEC is exec is needed, > but then say we can not promise instruction fetch will fail > if exec is not set. So the semantics could be: * device sets EXEC feature bit - driver sets EXEC flag in MAP, instruction fetch succeeds - driver doesn't set EXEC flag in MAP, instructions fetch will fail * device doesn't set EXEC feature bit, then driver cannot set EXEC flag but instruction fetch may not fail. But I'm tempted to just remove the EXEC flag for now, I don't think anyone needs it at the moment (and VFIO doesn't support it). I'd add it back as NO_EXEC which is closer to what IOMMUs offer. +\begin{description} + \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)] +Accesses to virtual addresses in this region have undefined behavior. >>> >>> undefined? how can platforms with such a property coexist with untrusted >>> devices? >> >> It is "undefined" because I can't enumerate all cases (don't know about >> all of them). This would contain for example the RMRR regions of VT-d >> (the BIOS reserving some DMA regions for itself), regions allocated for >> host-managed MSIs on arm platforms, and some PCI bridge windows >> (although they removed those from the Linux resv regions recently, not >> certain why). Ideally we wouldn't have any, but some special cases make >> it necessary. >> >> In general having reserved regions is incompatible with untrusted >> devices, and in some cases incompatible with untrusted guests. But >> choosing the policy about which devices to present to guest is up to the >> platform. The host knows what the resv regions are used for, and if they >> are safe enough. The guest just need to knows what to avoid. > > Yes but ... if you trust driver and device then what's the > point of the iommu? Improving memory allocation. The guest can do scatter-gather for large buffers, instead of allocating big contiguous chunks of guest-physical addresses. And with device pass-through, any memory used for DMA has to be pinned (since devices generally don't support page faults). So when assigning an endpoint to a guest, without a vIOMMU all of the guest memory has to be pinned upfront. With an IOMMU only the memory that will actually be used for DMA is pinned. > RMRR access basically fails with translation right? > Not undefined. >From section 3.15 of VT-d rev3.0, it looks like they are translated "Requests to these reserved regions may either occur as a result of operations performed by the system software driver (for example in the case of DMA from unified memory access (UMA) graphics controllers to graphics reserved memory), or may be initiated by non system software (for example in case of DMA performed by a USB controller under BIOS SMM control for legacy keyboard emulation)." > What happens with MSIs on ARM? On Arm platforms it's the IRQ chip rather than the IOMMU that performs MSI isolation - ensure that an endpoint doesn't trigger MSIs for another endpoint. The SMMU doesn't differentiate an MSI from a normal write, unlike x86 which has a special address range. MSI-X tables for a pass-through device are managed by the host, and a virtual MSI-X table is presented to the guest. So the host allocates a VA range and use it to map the MSI doorbell in the SMMU. It then reports that VA range as reserved to the guest. If the endpoint did write to the region, nothing bad would happen, it would simply trigger an MSI as if it had triggered via the MSI-X table. But the guest can't use that region for normal memory mapping. or never even reach it. >>> >>> that's ok >> >> We could be losing isolation if some bridge is intercepting accesses to >> a range of addresses. > > hmm sorry not sure I understand. After reading a bit more about this, I don't think this is relevant here. This was about lack of ACS isolation allowing endpoints to do p2p by writing to some specific MMIO region, but we already deal with this in the ATTACH section - either endpoints are properly isolated with ACS, or the platform describes them as being in the same IOMMU group and they cannot be isolated from each other. We don't need to care about reserved MMIO ranges on top of that. RESV_MEM is used for MSI and RMRR-style regions for now, so I'll reformulate this paragraph. I'll also require that accesses to RESV regions don't affect anything else than the endpoint and the SW that owns it. +The driver SHOULD treat
Re: [virtio-dev] Re: [PATCH v3] Add virtio-iommu device specification
Thanks for taking the time to review it! For brevity I only replied to questions and the bits I don't completely agree with, I will change the rest. On 03/05/2019 18:05, Michael S. Tsirkin wrote: > On Tue, Apr 30, 2019 at 02:56:48PM +0100, Jean-Philippe Brucker wrote: >> diff --git a/virtio-iommu.tex b/virtio-iommu.tex [...] >> +\item[VIRTIO_IOMMU_F_DOMAIN_BITS (1)] >> + The number of domains supported is described in \field{domain_bits} > > I would rename this to domain_range or something like this. > E.g. there's an option of 0 which is special. I don't see a use for it right now, but I can make this a range. >> +\drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device >> / Feature bits} >> + >> +The driver SHOULD accept any of the VIRTIO_IOMMU_F_INPUT_RANGE, >> +VIRTIO_IOMMU_F_DOMAIN_BITS, VIRTIO_IOMMU_F_MAP_UNMAP and >> +VIRTIO_IOMMU_F_PROBE feature bits if offered by the device. > > Why does it have to accept MAP_UNMAP if it does not want to use it? I was thinking a future version that adds another mode would remove this statement. I can remove it, it's clear enough that currently the driver won't get anything if it doesn't accept the bit. >> +If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all endpoints can >> +access guest-physical addresses ("bypass mode"). > > I think this means actually "all unattached endpoints"? > > > So how does this interact with e.g. encrypted memory? > I think it's actually weaker. This bit IMHO just means that > all accesses are allowed by the iommu and all addresses > are translated 1:1. Yes that's what it means, I'll try to clarify the description. I haven't looked at memory encryption yet, I guess it's disabled until you attach a domain, and then you need to pass keys in an extended MAP request. > Also what about other endpoints? With bypass > does an endpoint not in any domain get access to all other > endpoints? Note that it's guest-physical addresses, not physical memory. If other devices are memory-mapped, then with bypass an endpoint can access them. >> If the feature is not >> +negotiated, then any memory access from endpoints will fault. > > fault -> fail? > > So all accesses are disallowed, right? Yes (Minus reserved regions in some cases. More below) I rewrote the paragraph as: "If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all accesses from unattached endpoints are allowed and translated by the IOMMU using the identity function. If the feature is not negotiated, any memory access from an unattached endpoint will fail. Upon attaching an endpoint in bypass mode to a new domain, any memory access from the endpoint fails, since the domain does not contain any mapping." >> + >> +\begin{itemize} >> +\item \field{page_size_mask} contains the bitmask of all page sizes that >> + can be mapped. The least significant bit set defines the page >> + granularity of IOMMU mappings. > > so this refers to map/unmap requests only really? Yes >> Other bits in the mask are hints >> + describing page sizes that the IOMMU can merge into a single mapping >> + (page blocks). > > what does this merging refer to? I don't find it anywhere in the spec. It doesn't refer to anything else in the spec, the current wording isn't great. When the host stores mappings in page tables, it's useful for the guest to know which huge page sizes are available. For example if the physical IOMMU can map 2M and 1G blocks in addition to 4k pages, then the mask is 0x40201000. The hints allow a guest to optimize DMA allocation (iommu_dma_alloc() in Linux), in order to minimize the number of IOTLB entries used in hardware. I now have the following: "Other bits in \field{page_size_mask} are hints and describe larger page sizes that the IOMMU device handles efficiently. For example, when the device stores mappings using a page table tree, it may be able to describe large mappings using a few leaf entries in intermediate tables, rather than using lots of entries in the last level of the tree. Creating mappings aligned on large page size can improve performance since they require fewer page tables and TLB entries." >> + The smallest page granularity supported by the IOMMU is one byte. It is >> + legal for the driver to map one byte at a time if bit 0 of >> + \field{page_size_mask} is set. > > With single byte mappings, can't guest quickly exhaust > device memory? what guidance can we provide for handling > this kind of issue? how do device and driver avoid it? The advantage of a single-byte granularity is arbitrary mapping sizes, it doesn't mean that the driver will use any single-byte mapping. If the host uses page tables to store mappings then the granularity is generally 4kB.
Re: [virtio-dev] [PATCH v3] Add virtio-iommu device specification
On 03/05/2019 20:47, Michael S. Tsirkin wrote: > On Tue, Apr 30, 2019 at 03:42:14PM +0100, Jean-Philippe Brucker wrote: >> On 30/04/2019 14:56, Jean-Philippe Brucker wrote: >> > The IOMMU device allows a guest to manage DMA mappings for physical, >> > emulated and paravirtualized endpoints. Add device description for the >> > virtio-iommu device and driver. Introduce PROBE, ATTACH, DETACH, MAP and >> > UNMAP requests, as well as translation error reporting. >> > >> > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/37 >> > Signed-off-by: Jean-Philippe Brucker >> >> I'd like to request a vote for this version (the github issue is now up >> to date and the patch applies cleanly) >> >> Thanks, >> Jean > > I sent some comments that might be worth addressing before the vote. If > you don't agree I'm fine with starting voting now if you like and we'll > let the TC decide, pls let me know. Thanks for the review, I agree with most comments. I'll reply this week and then send a new version, so you can hold off the vote Thanks, Jean - To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
Re: [virtio-dev] [PATCH v3] Add virtio-iommu device specification
On 30/04/2019 14:56, Jean-Philippe Brucker wrote: > The IOMMU device allows a guest to manage DMA mappings for physical, > emulated and paravirtualized endpoints. Add device description for the > virtio-iommu device and driver. Introduce PROBE, ATTACH, DETACH, MAP and > UNMAP requests, as well as translation error reporting. > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/37 > Signed-off-by: Jean-Philippe Brucker I'd like to request a vote for this version (the github issue is now up to date and the patch applies cleanly) Thanks, Jean > --- > Since v2 I rebased onto virtio v1.1-wd02, fixing a conflict in > conformance.tex and using the new \conformance command. > > A PDF version is available at > https://jpbrucker.net/virtio-iommu/spec/virtio-v1.1-wd02-iommu-0.11-draft.pdf > --- > conformance.tex | 40 ++- > content.tex | 1 + > virtio-iommu.tex | 850 +++ > 3 files changed, 889 insertions(+), 2 deletions(-) > create mode 100644 virtio-iommu.tex > > diff --git a/conformance.tex b/conformance.tex > index 42f702a..79a3e7d 100644 > --- a/conformance.tex > +++ b/conformance.tex > @@ -15,14 +15,14 @@ \section{Conformance Targets}\label{sec:Conformance / > Conformance Targets} >\begin{itemize} > \item Clause \ref{sec:Conformance / Driver Conformance}. > \item One of clauses \ref{sec:Conformance / Driver Conformance / PCI > Driver Conformance}, \ref{sec:Conformance / Driver Conformance / MMIO Driver > Conformance} or \ref{sec:Conformance / Driver Conformance / Channel I/O > Driver Conformance}. > -\item One of clauses \ref{sec:Conformance / Driver Conformance / Network > Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver > Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver > Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver > Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory > Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI > Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input > Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto > Driver Conformance} or \ref{sec:Conformance / Driver Conformance / Socket > Driver Conformance}. > +\item One of clauses \ref{sec:Conformance / Driver Conformance / Network > Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver > Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver > Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver > Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory > Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI > Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input > Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto > Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Socket > Driver Conformance} or \ref{sec:Conformance / Driver Conformance / IOMMU > Driver Conformance}. > \item Clause \ref{sec:Conformance / Legacy Interface: Transitional > Device and Transitional Driver Conformance}. >\end{itemize} > \item[Device] A device MUST conform to four conformance clauses: >\begin{itemize} > \item Clause \ref{sec:Conformance / Device Conformance}. > \item One of clauses \ref{sec:Conformance / Device Conformance / PCI > Device Conformance}, \ref{sec:Conformance / Device Conformance / MMIO Device > Conformance} or \ref{sec:Conformance / Device Conformance / Channel I/O > Device Conformance}. > -\item One of clauses \ref{sec:Conformance / Device Conformance / Network > Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device > Conformance}, \ref{sec:Conformance / Device Conformance / Console Device > Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device > Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory > Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI > Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input > Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto > Device Conformance} or \ref{sec:Conformance / Device Conformance / Socket > Device Conformance}. > +\item One of clauses \ref{sec:Conformance / Device Conformance / Network > Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device > Conformance}, \ref{sec:Conformance / Device Conformance / Console Device > Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device > Conformance}, \ref{sec:Conforma