[virtio-dev] Re: [PATCH 4/4] virtio-iommu: Clarify hot-unplug behavior

2023-08-14 Thread Jean-Philippe Brucker
On Sat, Aug 12, 2023 at 03:25:10PM +0900, Akihiko Odaki wrote:
> > > By the way, crosvm's logic to detach endpoint on removal looks incorrect 
> > > for
> > > me. A domain may have several endpoints attached, but the code looks like
> > > it's always destroying a domain whether there are other endpoints attached
> > > to the domain. I'm adding Zide Chen, who wrote the code according to git
> > > blame, and crosvm-...@chromium.org to CC.
> > 
> > Link to this thread for more context:
> > https://lore.kernel.org/virtio-dev/20230803153238.541803-5-jean-phili...@linaro.org/
> > 
> > I thought crosvm rejected attaching multiple endpoints to one domain but I
> > think I misread. Rejecting multiple attach would be a straightforward fix
> > (it's allowed by the spec), though it would prevent assigning endpoints
> > that cannot be isolated from each others by the hardware (the driver won't
> > attach those to different domains, if it's made aware that they should be
> > in the same IOMMU group, for example if they are on a conventional PCI
> > bus).
> 
> Now we figured out an endpoint should not be detached from a domain without
> a request from the driver anyway so the code to detach an endpoint can be
> simply removed.

Yes, but I think the other detach path, when handling ATTACH or DETACH
requests, doesn't support domains with multiple endpoints attached either:

// Currently, we only support detaching an endpoint if it is the only 
endpoint attached
// to its domain.

But the ATTACH handler seems to accept attaching multiple endpoints to the
same domain?

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH 4/4] virtio-iommu: Clarify hot-unplug behavior

2023-08-11 Thread Jean-Philippe Brucker
On Fri, Aug 11, 2023 at 07:21:29AM +0900, Akihiko Odaki wrote:
> On 2023/08/11 0:10, Jean-Philippe Brucker wrote:
> > On Fri, Aug 04, 2023 at 03:19:27PM +0900, Akihiko Odaki wrote:
> > > On 2023/08/04 0:32, Jean-Philippe Brucker wrote:
> > > > Since it is not obvious what the virtio-iommu device or driver should do
> > > > when an endpoint is removed [1], add some guidance in the specification.
> > > > Follow the same principle as other (hardware) IOMMU devices: clearing
> > > > endpoint ID state on endpoint removal is the responsibility of the
> > > > driver, not the IOMMU device.
> > > > 
> > > > On most hardware IOMMU devices, the endpoint ID state is kept in tables
> > > > managed by the driver, so intuitively the driver should be updating the
> > > > tables on hot-unplug, so that a new endpoint plugged later with the same
> > > > ID does not inherit stale translations. Besides, hardware IOMMUs have no
> > > > knowledge of endpoint removal. It is less obvious for virtio-iommu
> > > > because endpoint states are managed with ATTACH/DETACH requests, and a
> > > > virtual platform could in theory update the endpoint state when it is
> > > > removed. Existing VMMs like QEMU don't do it, and rightly expect the
> > > > driver to manage endpoint ID state like with other IOMMUs. It is not
> > > > invalid for a VMM to clean up state on endpoint removal, but a driver
> > > > shouldn't count on it.
> > > 
> > > I think it's better to say it's invalid to detach on endpoint removal.
> > 
> > It's certainly not a clear cut choice and I'm still hesitating whether we
> > should allow, deny or let the VMM choose what to do.
> > 
> > However, it looks like crosvm already detaches the domain when an endpoint
> > is unplugged:
> > 
> > https://github.com/google/crosvm/blob/e8b2cd080e8174948563567d395c2a42416f2807/devices/src/virtio/iommu/sys/unix.rs#L68
> > 
> > If I understood correctly, this function is called on PCIe hot-unplug, and
> > endpoint_map represents the domain attached to an endpoint ID. So the
> > domain is detached on endpoint removal. If a DETACH request is sent
> > afterwards, it will still succeed (detach_endpoint() in
> > devices/src/virtio/iommu.rs returns detached=true, which causes DETACH to
> > succeed, even if the endpoint is not in endpoint_map). But I could be
> > wrong as I'm not familiar with crosvm or rust.
> > 
> > That does make the decision a little easier, because if we did
> > retroactively specify that detaching on removal is invalid, this code
> > would become a bug. But in my opinion crosvm isn't doing anything wrong
> > here. It feels valid and even good practice to clean up all state
> > associated to an endpoint when it is removed, to avoid leaks and stale
> > objects.
> 
> In this case I think it's creating use-after-free hazard. The guest believes
> it is managing domains and assume a domain is available until it makes a
> DETACH request for the last endpoint attached to the domain.
> 
> However, if the host automatically detaches an endpoint from a domain and if
> the domain has no other endpoints, the domain will be gone and the domain ID
> may be reused for another domain. If the guest makes a request with the
> stale domain ID before it becomes aware that the device is unplugged, the
> request will be performed on some arbitrary domain and may break things.

That's an excellent point, there is a race:

VMMGUEST
 Remove ep 1
   detach domain 1
   destroy domain 1
ATTACH ep 2 to domain 1
 Handle ATTACH
   create domain 1

 Notify that ep 1 is gone
Handle notification, DETACH ep 1

The guest tries to attach ep 2 to domain 1, which succeeds. So the
guest expects that ep 2 is now able to access the mappings that were on
domain 1. But instead it's a new empty domain.

I think that's a good justification for adding a device normative
statement, something like:

  The device SHOULD only detach endpoints from domains when requested by
  the driver; either when handling ATTACH and DETACH requests or when
  performing a device reset.

> 
> By the way, crosvm's logic to detach endpoint on removal looks incorrect for
> me. A domain may have several endpoints attached, but the code looks like
> it's always destroying a domain whether there are other endpoints attached
> to the domain. I'm adding Zide Chen, who wrote the code according to git
> blame, and crosvm-...@chromium.org to CC.

Link to this thread for more context:
https://lore.kernel.org/vi

[virtio-dev] Re: [PATCH 4/4] virtio-iommu: Clarify hot-unplug behavior

2023-08-10 Thread Jean-Philippe Brucker
On Fri, Aug 04, 2023 at 03:19:27PM +0900, Akihiko Odaki wrote:
> On 2023/08/04 0:32, Jean-Philippe Brucker wrote:
> > Since it is not obvious what the virtio-iommu device or driver should do
> > when an endpoint is removed [1], add some guidance in the specification.
> > Follow the same principle as other (hardware) IOMMU devices: clearing
> > endpoint ID state on endpoint removal is the responsibility of the
> > driver, not the IOMMU device.
> > 
> > On most hardware IOMMU devices, the endpoint ID state is kept in tables
> > managed by the driver, so intuitively the driver should be updating the
> > tables on hot-unplug, so that a new endpoint plugged later with the same
> > ID does not inherit stale translations. Besides, hardware IOMMUs have no
> > knowledge of endpoint removal. It is less obvious for virtio-iommu
> > because endpoint states are managed with ATTACH/DETACH requests, and a
> > virtual platform could in theory update the endpoint state when it is
> > removed. Existing VMMs like QEMU don't do it, and rightly expect the
> > driver to manage endpoint ID state like with other IOMMUs. It is not
> > invalid for a VMM to clean up state on endpoint removal, but a driver
> > shouldn't count on it.
> 
> I think it's better to say it's invalid to detach on endpoint removal.

It's certainly not a clear cut choice and I'm still hesitating whether we
should allow, deny or let the VMM choose what to do.

However, it looks like crosvm already detaches the domain when an endpoint
is unplugged:

https://github.com/google/crosvm/blob/e8b2cd080e8174948563567d395c2a42416f2807/devices/src/virtio/iommu/sys/unix.rs#L68

If I understood correctly, this function is called on PCIe hot-unplug, and
endpoint_map represents the domain attached to an endpoint ID. So the
domain is detached on endpoint removal. If a DETACH request is sent
afterwards, it will still succeed (detach_endpoint() in
devices/src/virtio/iommu.rs returns detached=true, which causes DETACH to
succeed, even if the endpoint is not in endpoint_map). But I could be
wrong as I'm not familiar with crosvm or rust.

That does make the decision a little easier, because if we did
retroactively specify that detaching on removal is invalid, this code
would become a bug. But in my opinion crosvm isn't doing anything wrong
here. It feels valid and even good practice to clean up all state
associated to an endpoint when it is removed, to avoid leaks and stale
objects.

> Otherwise, the DETACH request made by the driver on endpoint removal will
> result in an error, which may make the driver emit an spurious warning. In
> fact, the Linux driver emits a warning if a DETACH request fails*
> 
> * 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/iommu/virtio-iommu.c?id=809d0810e3520da669d231303608cdf5fe5c1a70
> 

I agree that this WARN_ON is problematic, we may have to demote it to
dev_warn() or remove it entirely.

PCIe supports async removal, where a device is physically removed without
the user first pressing a button to notify the OS that the device is going
away. And I'm currently playing with a PCIe thunderbolt drive where I just
yank the device out like any USB device, without first warning the OS
(apart from unmounting partitions). If a VMM wanted to emulate that, I
guess it would remove the device, possibly detach its IOMMU domain, and
notify the OS afterwards. So the driver needs to accept a failed DETACH
request more quietly.

Thanks,
Jean

> > 
> > [1] 
> > https://lore.kernel.org/all/15bf1b00-3aa0-973a-3a86-3fa5c4d41...@daynix.com/
> > 
> > Reported-by: Akihiko Odaki 
> > Signed-off-by: Jean-Philippe Brucker 
> > ---
> >   device-types/iommu/description.tex | 5 +
> >   1 file changed, 5 insertions(+)
> > 
> > diff --git a/device-types/iommu/description.tex 
> > b/device-types/iommu/description.tex
> > index fb1b840..63ccd41 100644
> > --- a/device-types/iommu/description.tex
> > +++ b/device-types/iommu/description.tex
> > @@ -407,6 +407,11 @@ \subsubsection{DETACH request}
> >   After all endpoints have been successfully detached from a domain, it
> >   ceases to exist and its ID can be reused by the driver for another domain.
> > +When an endpoint is removed from the platform (for example
> > +unplugged, or disabled with PCIe SR-IOV), the device is not
> > +required to detach the endpoint ID from its domain. It is the
> > +driver's responsibility to detach the endpoint before removal.
> > +
> >   \drivernormative{\paragraph}{DETACH request}{Device Types / IOMMU Device 
> > / Device operations / DETACH request}
> >   The driver SHOULD set \field{reserved} to zero.

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH 4/4] virtio-iommu: Clarify hot-unplug behavior

2023-08-03 Thread Jean-Philippe Brucker
Since it is not obvious what the virtio-iommu device or driver should do
when an endpoint is removed [1], add some guidance in the specification.
Follow the same principle as other (hardware) IOMMU devices: clearing
endpoint ID state on endpoint removal is the responsibility of the
driver, not the IOMMU device.

On most hardware IOMMU devices, the endpoint ID state is kept in tables
managed by the driver, so intuitively the driver should be updating the
tables on hot-unplug, so that a new endpoint plugged later with the same
ID does not inherit stale translations. Besides, hardware IOMMUs have no
knowledge of endpoint removal. It is less obvious for virtio-iommu
because endpoint states are managed with ATTACH/DETACH requests, and a
virtual platform could in theory update the endpoint state when it is
removed. Existing VMMs like QEMU don't do it, and rightly expect the
driver to manage endpoint ID state like with other IOMMUs. It is not
invalid for a VMM to clean up state on endpoint removal, but a driver
shouldn't count on it.

[1] https://lore.kernel.org/all/15bf1b00-3aa0-973a-3a86-3fa5c4d41...@daynix.com/

Reported-by: Akihiko Odaki 
Signed-off-by: Jean-Philippe Brucker 
---
 device-types/iommu/description.tex | 5 +
 1 file changed, 5 insertions(+)

diff --git a/device-types/iommu/description.tex 
b/device-types/iommu/description.tex
index fb1b840..63ccd41 100644
--- a/device-types/iommu/description.tex
+++ b/device-types/iommu/description.tex
@@ -407,6 +407,11 @@ \subsubsection{DETACH request}
 After all endpoints have been successfully detached from a domain, it
 ceases to exist and its ID can be reused by the driver for another domain.
 
+When an endpoint is removed from the platform (for example
+unplugged, or disabled with PCIe SR-IOV), the device is not
+required to detach the endpoint ID from its domain. It is the
+driver's responsibility to detach the endpoint before removal.
+
 \drivernormative{\paragraph}{DETACH request}{Device Types / IOMMU Device / 
Device operations / DETACH request}
 
 The driver SHOULD set \field{reserved} to zero.
-- 
2.41.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH 3/4] virtio-iommu: Fix contradictory and unnecessary statements

2023-08-03 Thread Jean-Philippe Brucker
In the device-normative statement for the ATTACH request, we say:

  If another endpoint is already attached to the domain identified by
  \field{domain}, then the device MAY attach the endpoint identified by
  \field{endpoint} to the domain.

and later:

  If properties of the endpoint (obtained with a PROBE request) are
  compatible with properties of other endpoints already attached to the
  requested domain, then the device SHOULD attach the endpoint.

These statements contradict each others. Attaching multiple endpoints to
one domain is a convenience that allows using fewer page table
resources. The device is always allowed to reject a second attach
request, in which case the driver simply creates a separate domain for
the second endpoint. Remove the "SHOULD" statement and keep the "MAY"
one.

Also remove the following statement which doesn't add anything useful.

Signed-off-by: Jean-Philippe Brucker 
---
 device-types/iommu/description.tex | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/device-types/iommu/description.tex 
b/device-types/iommu/description.tex
index c9b43db..fb1b840 100644
--- a/device-types/iommu/description.tex
+++ b/device-types/iommu/description.tex
@@ -383,12 +383,9 @@ \subsubsection{ATTACH request}\label{sec:Device Types / 
IOMMU Device / Device op
 VIRTIO_IOMMU_S_UNSUPP.
 
 If properties of the endpoint (obtained with a PROBE request) are
-compatible with properties of other endpoints already attached to the
-requested domain, then the device SHOULD attach the endpoint. Otherwise
-the device SHOULD reject the request and set \field{status} to
-VIRTIO_IOMMU_S_UNSUPP.
-
-A device that does not reject the request MUST attach the endpoint.
+not compatible with properties of other endpoints already
+attached to the domain, the device SHOULD reject the request and
+set \field{status} to VIRTIO_IOMMU_S_UNSUPP.
 
 \subsubsection{DETACH request}
 
-- 
2.41.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH 1/4] virtio-iommu: Fix typo in label

2023-08-03 Thread Jean-Philippe Brucker
Use the 'sec:' prefix like all other labels

Signed-off-by: Jean-Philippe Brucker 
---
 device-types/iommu/description.tex | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/device-types/iommu/description.tex 
b/device-types/iommu/description.tex
index f8cbe88..d13a880 100644
--- a/device-types/iommu/description.tex
+++ b/device-types/iommu/description.tex
@@ -796,7 +796,7 @@ \subsubsection{PROBE properties}\label{sec:Device Types / 
IOMMU Device / Device
 The device SHOULD NOT allow accesses from the endpoint to RESV_MEM regions
 to affect any other component than the endpoint and the driver.
 
-\subsubsection{Fault reporting}\label{sev:Device Types / IOMMU Device / Device 
operations / Fault reporting}
+\subsubsection{Fault reporting}\label{sec:Device Types / IOMMU Device / Device 
operations / Fault reporting}
 
 The device can report translation faults and other significant
 asynchronous events on the event virtqueue. The driver initially populates
-- 
2.41.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH 0/4] virtio-iommu: Minor specification fixes

2023-08-03 Thread Jean-Philippe Brucker
Patches 1 and 2 fix typos in the IOMMU device type. Patch 3 removes a
contradictory statement, and a useless one. Patch 4 adds some
information about removal of endpoints managed by the IOMMU, following a
recent fix in the Linux driver.

Jean-Philippe Brucker (4):
  virtio-iommu: Fix typo in label
  virtio-iommu: Fix RESV_MEM typo
  virtio-iommu: Fix contradictory and unnecessary statements
  virtio-iommu: Clarify hot-unplug behavior

 device-types/iommu/description.tex | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

-- 
2.41.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH 2/4] virtio-iommu: Fix RESV_MEM typo

2023-08-03 Thread Jean-Philippe Brucker
Remove an extra word in the RESV_MEM description.

Signed-off-by: Jean-Philippe Brucker 
---
 device-types/iommu/description.tex | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/device-types/iommu/description.tex 
b/device-types/iommu/description.tex
index d13a880..c9b43db 100644
--- a/device-types/iommu/description.tex
+++ b/device-types/iommu/description.tex
@@ -758,8 +758,8 @@ \subsubsection{PROBE properties}\label{sec:Device Types / 
IOMMU Device / Device
 \begin{description}
   \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
 These virtual addresses cannot be used in a MAP requests. The region
-is be reserved by the device, for example, if the platform needs to
-setup DMA mappings of its own.
+is reserved by the device, for example if the platform needs
+to setup DMA mappings of its own.
 
   \item[VIRTIO_IOMMU_RESV_MEM_T_MSI (1)]
 This region is a doorbell for Message Signaled Interrupts (MSIs). It
-- 
2.41.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: virtio-iommu hotplug issue

2023-04-14 Thread Jean-Philippe Brucker
On Thu, Apr 13, 2023 at 08:01:54PM +0900, Akihiko Odaki wrote:
> Yes, that's right. The guest can dynamically create and delete VFs. The
> device is emulated by QEMU: igb, an Intel NIC recently added to QEMU and
> projected to be released as part of QEMU 8.0.

Ah great, that's really useful, I'll add it to my tests

> > Yes, I think this is an issue in the virtio-iommu driver, which should be
> > sending a DETACH request when the VF is disabled, likely from
> > viommu_release_device(). I'll work on a fix unless you would like to do it
> 
> It will be nice if you prepare a fix. I will test your patch with my
> workload if you share it with me.

I sent a fix:
https://lore.kernel.org/linux-iommu/20230414150744.562456-1-jean-phili...@linaro.org/

Thank you for reporting this, it must have been annoying to debug

Thanks,
Jean


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: virtio-iommu hotplug issue

2023-04-13 Thread Jean-Philippe Brucker
Hello,

On Thu, Apr 13, 2023 at 01:49:43PM +0900, Akihiko Odaki wrote:
> Hi,
> 
> Recently I encountered a problem with the combination of Linux's
> virtio-iommu driver and QEMU when a SR-IOV virtual function gets disabled.
> I'd like to ask you what kind of solution is appropriate here and implement
> the solution if possible.
> 
> A PCIe device implementing the SR-IOV specification exports a virtual
> function, and the guest can enable or disable it at runtime by writing to a
> configuration register. This effectively looks like a PCI device is
> hotplugged for the guest.

Just so I understand this better: the guest gets a whole PCIe device PF
that implements SR-IOV, and so the guest can dynamically create VFs?  Out
of curiosity, is that a hardware device assigned to the guest with VFIO,
or a device emulated by QEMU?

> In such a case, the kernel assumes the endpoint is
> detached from the virtio-iommu domain, but QEMU actually does not detach it.
> 
> This inconsistent view of the removed device sometimes prevents the VM from
> correctly performing the following procedure, for example:
> 1. Enable a VF.
> 2. Disable the VF.
> 3. Open a vfio container.
> 4. Open the group which the PF belongs to.
> 5. Add the group to the vfio container.
> 6. Map some memory region.
> 7. Close the group.
> 8. Close the vfio container.
> 9. Repeat 3-8
> 
> When the VF gets disabled, the kernel assumes the endpoint is detached from
> the IOMMU domain, but QEMU actually doesn't detach it. Later, the domain
> will be reused in step 3-8.
> 
> In step 7, the PF will be detached, and the kernel thinks there is no
> endpoint attached and the mapping the domain holds is cleared, but the VF
> endpoint is still attached and the mapping is kept intact.
> 
> In step 9, the same domain will be reused again, and the kernel requests to
> create a new mapping, but it will conflict with the existing mapping and
> result in -EINVAL.
> 
> This problem can be fixed by either of:
> - requesting the detachment of the endpoint from the guest when the PCI
> device is unplugged (the VF is disabled)

Yes, I think this is an issue in the virtio-iommu driver, which should be
sending a DETACH request when the VF is disabled, likely from
viommu_release_device(). I'll work on a fix unless you would like to do it

> - detecting that the PCI device is gone and automatically detach it on
> QEMU-side.
> 
> It is not completely clear for me which solution is more appropriate as the
> virtio-iommu specification is written in a way independent of the endpoint
> mechanism and does not say what should be done when a PCI device is
> unplugged.

Yes, I'm not sure it's in scope for the specification, it's more about
software guidance

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [virtio-comment] Re: [virtio] [PATCH] virtio: clarify feature negotiation

2022-01-26 Thread Jean-Philippe Brucker
On Mon, Jan 24, 2022 at 04:42:35PM -0500, Michael S. Tsirkin wrote:
> So from that point of view, we are actually relaxing the requirements,
> and yes we'll need to carefully go over each instance of
> "offered".
> I thought I did, but now I did a quick grep again and I found instances
> in virtio-iommu.tex and virtio-gpio.tex
> Both of these are new, so I think we can just fix them to "negotiated".
> 
> Jean-Philippe, Viresh, what do you think? Would it be a problem to
> replace offered with negotiated, and restrict iommu and gpio drivers to
> access config space after FEATURES_OK?

No problem for iommu. The config space fields domain_range and input_range
are not useful to the driver before DRIVER_OK. Reading field bypass
earlier could be useful for debugging, but in that case going a few more
steps into device initialization wouldn't hurt.

Thanks,
Jean


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [PATCH V2 2/2] virtio-pci: add PASID configuration extended capability

2022-01-24 Thread Jean-Philippe Brucker
On Wed, Jan 19, 2022 at 06:53:17PM -0500, Michael S. Tsirkin wrote:
> > I think we should be explicit about this particular case because someone
> > implementing this extension might get it wrong. MSIs should not have a
> > PASID because IOMMUs don't support it:
> > 
> > * VT-d treats Requests-with-PASID to the special range 0xfeex as
> >   normal DMA (3.14 Handling Requests to Interrupt Address Range)
> > * AMD IOMMU reports an error for MSIs with PASID (2.2.7.7 PCIe® TLP PASID
> >   Prefix).
> 
> Ouch. I didn't realize. Really weird, of the "what were they thinking"
> variety. The natural thing would be to have PASID==VM and scope
> both DMA and MSI within PASID. I guess that is precluded on these
> platforms then.
> 

I think the problem is mainly that MSIs appear as normal memory writes
requests. If PCI had given them a special type, IOMMU/IRQ remapping
hardware could route them more easily, but as is they only have the
transaction's address to work with. So vendors made different choices for
supporting MSIs with their IRQ remapping hardware. Intel and AMD define
reserved address ranges 0xfeex, where any memory transaction is
treated as MSI.

This is fine when the kernel is in charge of allocating the I/O address
space for DMA, it just needs to make sure IOVAs are not allocated within
the address region reserved for MSIs. But it's more complicated with SVA.

One of the use of PASID is assigning a partition of device (here a group
of virtqueues) to a process. With this approach (SVA) the PASID accesses
the process' page tables, so the virtual address space is shared between
CPU and IOMMU, IOVA==VA. How would we deal with MSIs in this case, if they
had a PASID?  On x86, if the IOMMU performed IRQ remapping instead of
address translation, on a memory write with PASID to address 0xfeex,
the process couldn't use any address in that range for DMA. The process
would need to filter addresses returned by malloc() and treat some of them
as non-DMA'able, which defeats the purpose of SVA. To avoid having to do
this Intel treats all DMA with PASID as normal memory access, and the
process can't accidentally trigger MSIs by using the wrong address. I
think AMD behaves the same way but am not entirely sure - the wording in
one part of the spec (2.2.7.7) seems to imply that PASID access to the MSI
range would trigger an IOMMU error report, in which case the range would
need carving out anyway.

> > * Arm requires creating mappings to the MSI controller in the SMMU. This
> >   has implications for SVA where the PASID accesses the whole process
> >   address space: if MSI transactions have a PASID prefix, that requires
> >   mapping the MSI controller into the process address space on Arm, which
> >   isn't convenient.
> 
> I'd like to understand this part a bit better. Can you explain?
> We are talking about the IOMMU address space right?
> What is wrong with mapping the MSI controller there?
> Isn't convenient in what sense?

On Arm the IRQ remapping device is separate from the IOMMU, so there is no
special address range as far as the IOMMU is concerned. An MSI goes
through IOMMU address translation like any other memory write, the IOVA
gets translated into a PA that corresponds to the doorbell (MMIO) address
of the IRQ remapping device. The IOVA of the MSI is chosen by the OS and
can have any value. Now if MSIs had a PASID, then the IOMMU would
translate MSI addresses with the page tables corresponding to that PASID,
which with SVA correspond to the process page tables. So the kernel would
need to create a mapping inside the process' address space, to the PA of
the IRQ remapping device. That's the part that is inconvenient (would not
be a security issue because accesses to the doorbell from CPUs are ignored
according to Arm's platform spec, but it would certainly be awkward to set
up).

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [PATCH V2 2/2] virtio-pci: add PASID configuration extended capability

2022-01-19 Thread Jean-Philippe Brucker
On Mon, Jan 17, 2022 at 01:57:47PM +0800, Jason Wang wrote:
> How about something like:
> 
> "
> Except for the MSI transactions, the device MUST use PASID TLP prefix
> for the following memory transactions initiated by the virtqueue that
> belong to a virtqueue group if a valid PASID is assigned and PASID is
> enabled in the PASID extended capability:
> 
> \begin{itemize}
> \item Memory Requests (including AtomicOp Requests) with Untranslated
>   Addresses
> \item Address Translation Requests
> \item Page Request Messages
> \end{itemize}
> 
> The device MUST NOT use PASID TLP prefix for the MSI memory
> transactions.
> 

Sure, that looks good. Maybe splitting it would make it easier to explain.
If we force the driver to enable the PASID cap before using PASID:

"The driver MUST NOT assign a PASID to a virtqueue group if the PASID
extended capability is not enabled."

Then the device description is not concerned about the PASID cap:

"If a valid PASID is assigned to a virtqueue group, the device MUST use a
PASID TLP prefix for memory transactions of the following types initiated
by any virtqueue in the group:
  ...

The device MUST NOT use a PASID TLP prefix for MSI memory transactions.
"

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Some thoughts on zero-copy between VM domains for discussion

2022-01-14 Thread Jean-Philippe Brucker
On Fri, Jan 14, 2022 at 02:28:11PM +, Alex Bennée wrote:
> 
> Stefan Hajnoczi  writes:
> 
> > [[PGP Signed Part:Undecided]]
> > On Thu, Jan 06, 2022 at 05:03:38PM +, Alex Bennée wrote:
> >> 
> >> Hi,
> >> 
> >> To start the new year I thought would dump some of my thoughts on
> >> zero-copy between VM domains. For project Stratos we've gamely avoided
> >> thinking too hard about this while we've been concentrating on solving
> >> more tractable problems. However we can't put it off forever so lets
> >> work through the problem.
> >> 
> >> Memory Sharing
> 
> >> Buffer Allocation
> 
> >> Transparent fallback and scaling
> 
> >> 
> >>  - what other constraints we need to take into account?
> >>  - can we leverage existing sub-systems to build this support?
> >> 
> >> I look forward to your thoughts ;-)
> >
> > (Side note: Shared Virtual Addressing (https://lwn.net/Articles/747230/)
> > is an interesting IOMMU feature. It would be neat to have a CPU
> > equivalent where loads and stores from/to another address space could be
> > done cheaply with CPU support. I don't think this is possible today and
> > that's why software IOMMUs are slow for per-transaction page protection.
> > In other words, a virtio-net TX VM would set up a page table allowing
> > read access only to the TX buffers and vring and the virtual network
> > switch VM would have the capability to access the vring and buffers
> > through the TX VM's dedicated address space.)
> 
> Does binding a device to an address space mean the driver allocations
> will be automatically done from the address space or do the drivers need
> modifying to take advantage of that? Jean-Phillipe?

The drivers do need modifying and the APIs are very different: with SVA
you're assigning partitions of devices (for example virtqueues) to
different applications. So the kernel driver only does management and
userspace takes care of the data path. A device partition accesses the
whole address space of the process it is assigned to, so there is no
explicit DMA allocation.

Note that it also requires special features from the device (multiple
address spaces with PCIe PASID, recoverable DMA page faults with PRI).
And the concept doesn't necessarily fit nicely with all device classes -
you probably don't want to load a full network stack in any application
that needs to send a couple of packets. One that demonstrates the concept
well in my opinion is the crypto/compression accelerators that use SVA in
Linux [1]: any application that needs fast compression or encryption on
some of its memory can open a queue in the accelerator, submit jobs with
pointers to input and output buffers and wait for completion while doing
something else. It is several orders of magnitude faster than letting the
CPU do this work and only the core mm deals with memory management.

Anyway this seems mostly off-topic. As Stefan said what would be useful
for our problem is help from the CPUs to share bits of address space
without disclosing the whole VM memory. At the moment any vIOMMU mapping
needs to be shadowed by the hypervisor somehow. Unless we use static
shared memory regions, frontend VM needs to tell the hypervisor which
pages it chooses to share with backend VM, and the hypervisor to map then
unmap that into backend VM's stage-2. I think that operation requires two
context switches to the hypervisor any way we look at it.

Thanks,
Jean

[1] 
https://lore.kernel.org/linux-iommu/1579097568-17542-1-git-send-email-zhangfei@linaro.org/

> 
> > Some storage and networking applications use buffered I/O where the
> > guest kernel owns the DMA buffer while others use zero-copy I/O where
> > guest userspace pages are pinned for DMA. I think both cases need to be
> > considered.
> >
> > Are guest userspace-visible API changes allowed (e.g. making the
> > userspace application aware at buffer allocation time)?
> 
> I assume you mean enhanced rather than breaking APIs here? I don't see
> why not. Certainly for the vhost-user backends we are writing we aren't
> beholden to sticking to an old API.
> 
> > Ideally the
> > requirement would be that zero-copy must work for unmodified
> > applications, but I'm not sure if that's realistic.
> >
> > By the way, VIRTIO 1.2 introduces Shared Memory Regions. These are
> > memory regions (e.g. PCI BAR ranges) that the guest driver can map. If
> > the host pages must come from a special pool then Shared Memory Regions
> > would be one way to configure the guest to use this special zero-copy
> > area for data buffers and even vrings. New VIRTIO feature bits and
> > transport-specific information may need to be added to do this.
> 
> Are these fixed sizes or could be accommodate a growing/shrinking region?
> 
> Thanks for the pointers.
> 
> -- 
> Alex Bennée

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: 

Re: [virtio-dev] Re: [PATCH V2 2/2] virtio-pci: add PASID configuration extended capability

2022-01-14 Thread Jean-Philippe Brucker
On Fri, Jan 14, 2022 at 11:15:35AM +0800, Jason Wang wrote:
> 
> 在 2022/1/13 下午11:17, Michael S. Tsirkin 写道:
> > On Thu, Jan 13, 2022 at 02:53:36PM +, Stefan Hajnoczi wrote:
> > > On Thu, Jan 13, 2022 at 05:45:11AM -0500, Michael S. Tsirkin wrote:
> > > > On Thu, Jan 13, 2022 at 10:32:11AM +, Stefan Hajnoczi wrote:
> > > > > > So my understanding is that instead of delaying the response to read
> > > > > > it's better to simply present the previous PASID value if there're 
> > > > > > any
> > > > > > pending transactions of previous PASID value. The driver can then
> > > > > > choose to poll or using timers etc.
> > > > Don't we limit these changes to before DRIVER_OK?
> > > > If we do then no previous transactions can exist.
> > > Not sure if that's possible for the vDPA virtqueue assignment use case
> > > where one large VIRTIO device provides virtqueues for many smaller vDPA
> > > devices?
> 
> 
> That could be the case but usually a smaller vDPA device requires a bunch of
> hardware resources more than just virtqueues.
> 
> Another example is assign a queue directly to userspace with PASID for SVA.
> 
> 
> > > 
> > > Stefan
> > Then I guess queue must be stopped?
> 
> 
> Yes, since we have virtqueue reset, I can limit the assigning before
> DRIVER_OK or queue(s) are reset. Does this work?

Yes, I was wondering how this would work for the SVA use-case, where
processes can request queues from the virtio driver at runtime:

* Process opens a fd to the queue
* virtio driver selects a disabled queue, gets a PASID for that
  task and assigns it to the queue's group 
* process mmaps its queue and works with it. When it is done, it closes
  the fd.
* virtio driver now makes sure that the virtqueue does not issue any
  more DMA with this PASID, it disables all the queues in the group by
  writing 1 to their queue_reset. After reset completes, it clears the
  group_pasid field.
* The driver releases the PASID (iommu_sva_unbind_device()).
  Both PASID and queue can now safely be assigned to other processes.

So making group_pasid field writeable only if all queues in the group are
disabled should work

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [PATCH V2 2/2] virtio-pci: add PASID configuration extended capability

2022-01-14 Thread Jean-Philippe Brucker
Hi Jason,

On Thu, Jan 13, 2022 at 09:24:16AM +0800, Jason Wang wrote:
> The device MUST use PASID TLP prefix for all memory transactions
> initiated by the virtqueue that belong to a virtqueue group ...
> 
> > vring reads/writes and data buffer reads/writes come
> > to mind. Virtqueue MSI-X messages are probably not included? Anything
> > else?
> 
> According to the PCIe spec and the semantic, the PASID TLP prefix
> should be used for
> 
> Memory Requests (including AtomicOp Requests) with Untranslated Addresses.
> Address Translation Requests, ATS Invalidation Messages, Page Request
> Messages, and PRG Response Messages
> 
> So we probably don't need to differentiate MSI-X messages with DMA
> here. That's why I think a general "memory transaction" should be
> sufficient here. If you don't agree, I can clarify it as what PCIe
> spec did.

I think we should be explicit about this particular case because someone
implementing this extension might get it wrong. MSIs should not have a
PASID because IOMMUs don't support it:

* VT-d treats Requests-with-PASID to the special range 0xfeex as
  normal DMA (3.14 Handling Requests to Interrupt Address Range)
* AMD IOMMU reports an error for MSIs with PASID (2.2.7.7 PCIe® TLP PASID
  Prefix).
* Arm requires creating mappings to the MSI controller in the SMMU. This
  has implications for SVA where the PASID accesses the whole process
  address space: if MSI transactions have a PASID prefix, that requires
  mapping the MSI controller into the process address space on Arm, which
  isn't convenient.

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH v3] virtio-iommu: Rework the bypass feature

2021-11-11 Thread Jean-Philippe Brucker
Hi Halil,

On Thu, Nov 11, 2021 at 11:26:36AM +0100, Halil Pasic wrote:
> On Fri, 22 Oct 2021 13:12:20 +0100
> Jean-Philippe Brucker  wrote:
> 
> > The VIRTIO_IOMMU_F_BYPASS feature is awkward to use and incomplete.
> > Although it is implemented by QEMU, it is not supported by any driver as
> > far as I know. Replace it with a new VIRTIO_IOMMU_F_BYPASS_CONFIG
> > feature.
> 
> Hi Jean-Philippe!
> 
> Is somebody working on the implementation of this? I would like to be
> Cc-ed with the implementation patches if possible.

Yes I posted the Linux support:
https://lore.kernel.org/linux-iommu/20211013121052.518113-1-jean-phili...@linaro.org/

And QEMU, which depends on the Linux header:
https://lore.kernel.org/qemu-devel/20210930185050.262759-1-jean-phili...@linaro.org/

I plan to resend after the merge window and will Cc you.

> The interaction of s390x and IOMMU is sometimes a little tricky. I would
> like to make sure nothing breaks for us. :)

Bypass should be constrained within the virtio-iommu device and driver,
and so shouldn't break s390 as far as I know. I haven't thought much about
supporting virtio-iommu for archs other than arm and x86 for the moment
(in general it should be as simple as adding it to ACPI tables or DT).
I believe power already has a PV IOMMU interface and doesn't need a
dedicated vIOMMU device, but I'm not sure where s390 stands

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3] virtio-iommu: Rework the bypass feature

2021-10-22 Thread Jean-Philippe Brucker
The VIRTIO_IOMMU_F_BYPASS feature is awkward to use and incomplete.
Although it is implemented by QEMU, it is not supported by any driver as
far as I know. Replace it with a new VIRTIO_IOMMU_F_BYPASS_CONFIG
feature.

Two features are missing from virtio-iommu:

* The ability for an hypervisor to start the device in bypass mode. The
  wording for VIRTIO_IOMMU_F_BYPASS is not clear enough to allow it at
  the moment, because it only specifies the behavior after feature
  negotiation.

* The ability for a guest to set individual endpoints in bypass mode
  when bypass is globally disabled. An OS should have the ability to
  allow only endpoints it trusts to bypass the IOMMU, while keeping DMA
  disabled for endpoints it isn't even aware of. At the moment this can
  only be emulated by creating identity mappings.

The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a 'bypass' config field
that allows to enable and disable bypass globally. It also adds a new
flag for the ATTACH request.

* The hypervisor can start the VM with bypass enabled or, if it knows
  that the software stack supports it, disabled. The 'bypass' config
  fields is initialized to 0 or 1. It is sticky and isn't affected by
  device reset.

* Generally the firmware won't have an IOMMU driver and will need to be
  started in bypass mode, so the bootloader and kernel can be loaded
  from storage endpoint.

  For more security, the firmware could implement a minimal virtio-iommu
  driver that reuses existing virtio support and only touches the config
  space. It could enable PCI bus mastering in bridges only for the
  endpoints that need it, enable global IOMMU bypass by flipping a bit,
  then tear everything down before handing control over to the OS. This
  prevents vulnerability windows where a malicious endpoint reprograms
  the IOMMU while the OS is configuring it [1].

  The isolation provided by vIOMMUs has mainly been used for securely
  assigning endpoints to untrusted applications so far, while kernel DMA
  bypasses the IOMMU. But we can expect boot security to become as
  important in virtualization as it presently is on bare-metal systems,
  where some devices are untrusted and must never be able to access
  memory that wasn't assigned to them.

* The OS can enable and disable bypass globally. It can then enable
  bypass for individual endpoints by attaching them to bypass domains,
  using the new VIRTIO_IOMMU_ATTACH_F_BYPASS flag. It can disable bypass
  by attaching them to normal domains.

[1] IOMMU protection against I/O attacks: a vulnerability and a proof of concept
Morgan, B., Alata, É., Nicomette, V. et al.
https://link.springer.com/article/10.1186/s13173-017-0066-7

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/119
Signed-off-by: Jean-Philippe Brucker 
---

The virtio-iommu spec with colored diff is available at
https://jpbrucker.net/virtio-iommu/spec-bypass/virtio-iommu-f-bypass-config-v3-diff.pdf

v1: https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07817.html
v2: https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07898.html
v3:
* Normative statement about device reset vs. system reset - the bypass
  bit is sticky across device reset to avoid the vulnerability described
  above, but restored on system reset (a term also used by the virtio
  memory device).
* Explain that the field bypass is in effect as long as the new feature
  is offered, even when not accepted by the driver
* Another clarification about the state of an endpoint after detach.
---
 conformance.tex  |   1 -
 virtio-iommu.tex | 111 ++-
 2 files changed, 90 insertions(+), 22 deletions(-)

diff --git a/conformance.tex b/conformance.tex
index c2d7959..5c10ab9 100644
--- a/conformance.tex
+++ b/conformance.tex
@@ -504,7 +504,6 @@ \section{Conformance Targets}\label{sec:Conformance / 
Conformance Targets}
 \begin{itemize}
 \item \ref{devicenormative:Device Types / IOMMU Device / Feature bits}
 \item \ref{devicenormative:Device Types / IOMMU Device / Device configuration 
layout}
-\item \ref{devicenormative:Device Types / IOMMU Device / Device Initialization}
 \item \ref{devicenormative:Device Types / IOMMU Device / Device operations}
 \item \ref{devicenormative:Device Types / IOMMU Device / Device operations / 
ATTACH request}
 \item \ref{devicenormative:Device Types / IOMMU Device / Device operations / 
DETACH request}
diff --git a/virtio-iommu.tex b/virtio-iommu.tex
index 08b358a..f8cbe88 100644
--- a/virtio-iommu.tex
+++ b/virtio-iommu.tex
@@ -59,14 +59,21 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU 
Device / Feature bits}
   VIRTIO_IOMMU_F_MAP_UNMAP is supported.}
 
 \item[VIRTIO_IOMMU_F_BYPASS (3)]
-  When not attached to a domain, endpoints downstream of the IOMMU
-  can access the guest-physical address space.
+  Endpoints that are not attached to a domain are in bypass mode.
 
 \item[VIRTIO_IOMMU_F_PROBE (4)]
   The PROBE request is available.
 
 \item

Re: [virtio-dev] [PATCH v2] virtio-iommu: Rework the bypass feature

2021-10-15 Thread Jean-Philippe Brucker
On Thu, Oct 14, 2021 at 02:05:30AM +, Tian, Kevin wrote:
> > > >   For more security, the firmware could implement a minimal virtio-
> > iommu
> > > >   driver that reuses existing virtio support and only touches the config
> > > >   space. It could enable PCI bus mastering in bridges only for the
> > > >   endpoints that need it, enable global IOMMU bypass by flipping a bit,
> > >
> > > then untrusted devices can do DMA after global bypass is enabled?
> > 
> > Not necessarily, I think firmware could configure PCI switch ports
> > upstream of untrusted devices, by keeping Bus Master Enable disabled in
> > their config space which will block Memory and I/O Requests.
> 
> guest firmware only manages virtual switch ports. Physically they are
> supposed to be already enabled in the process of device passthrough. 

Right I was considering virtual devices and virtual switches. For physical
devices, the switch emulation would need to disable the IOMMU mappings
when bus mastering is disabled in the switch. In either case, though, it's
probably more complexity than hypervisors are willing to put in. QEMU
seems to ignore Bus Master Enable on root ports, for example.

> Given a device is untrusted, it may ignore Bus Master Enable bit 
> inadvertently or deliberately to issue DMA. I feel concept-wise it's
> not good for the firmware to open this window when boot-bypass is
> disabled...
> 
> > 
> > > suppose
> > > here just needs attach storage device to a normal domain, leaving other
> > > devices still blocked from doing DMA
> > 
> > Yes that's the safest route, but also more complicated to implement.
> > If firmware can setup the PCI topology as above, its virtio-iommu driver
> > just needs to poke the config space for feature negotiation and enabling
> > bypass, no need to setup virtqueues.
> 
> I see the value of this simplified approach. But if it cannot isolate 
> untrusted devices then we may have to follow the complicated route

Agreed

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH v2] virtio-iommu: Rework the bypass feature

2021-10-13 Thread Jean-Philippe Brucker
On Wed, Oct 13, 2021 at 08:37:51AM +, Tian, Kevin wrote:
> > * The ability for an hypervisor to start the device in bypass mode. The
> >   wording for VIRTIO_IOMMU_F_BYPASS is not clear enough to allow it at
> >   the moment.
> 
> unclear because it's worded around driver negotiation thus the behavior
> is undefined before the driver is loaded?

Yes exactly. The spec doesn't spell out what the behavior is before
feature negotiation, even though the straightforward interpretation is
that DMA doesn't bypass (which prevents an hypervisor from implementing
boot-bypass at the moment).


> > 
> > * The ability for a guest to set individual endpoints in bypass mode
> >   when bypass is globally disabled. An OS should have the ability to
> >   allow only endpoints it trusts to bypass the IOMMU, while keeping DMA
> >   disabled for endpoints it isn't even aware of. At the moment this can
> >   only be emulated by creating identity mappings.
> 
> Is driver required to first disable global bypass before it can attach
> endpoints to normal domains?

No

> 
> > 
> > The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a 'bypass' config field
> > that allows to enable and disable bypass globally. It also adds a new
> > flag for the ATTACH request.
> > 
> > * The hypervisor can start the VM with bypass enabled or, if it knows
> >   that the software stack supports it, disabled. The 'bypass' config
> >   fields resets to 0 or 1.
> 
> is sticky and unchanged by reset
> 
> > 
> > * Generally the firmware won't have an IOMMU driver and will need to be
> >   started in bypass mode, so the bootloader and kernel can be loaded
> >   from storage endpoint.
> 
> How does the firmware know whether storage load is allowed w/o looking
> at the bypass info in virtio-iommu? 

Either the firmware can't drive virtio-iommu, and must be booted in
bypass, or it can and then it sets/clears bypass. That's why we're going
to default qemu to boot-bypass [1], because in general the firmware won't
support anything else. An admin that has more knowledge in the software
stack, and knows that the firmware is able to drive the IOMMU, can then
disable boot-bypass for more security.

[1] 
https://lore.kernel.org/qemu-devel/20210930185050.262759-1-jean-phili...@linaro.org/

> 
> > 
> >   For more security, the firmware could implement a minimal virtio-iommu
> >   driver that reuses existing virtio support and only touches the config
> >   space. It could enable PCI bus mastering in bridges only for the
> >   endpoints that need it, enable global IOMMU bypass by flipping a bit,
> 
> then untrusted devices can do DMA after global bypass is enabled?

Not necessarily, I think firmware could configure PCI switch ports
upstream of untrusted devices, by keeping Bus Master Enable disabled in
their config space which will block Memory and I/O Requests.

> suppose
> here just needs attach storage device to a normal domain, leaving other
> devices still blocked from doing DMA

Yes that's the safest route, but also more complicated to implement.
If firmware can setup the PCI topology as above, its virtio-iommu driver
just needs to poke the config space for feature negotiation and enabling
bypass, no need to setup virtqueues.

> 
> >   then tear everything down before handing control over to the OS. This
> >   prevents vulnerability windows where a malicious endpoint reprograms
> >   the IOMMU while the OS is configuring it [1].

[...]
> > +An endpoint is in bypass mode if:
> > +\begin{itemize}
> > +  \item the VIRTIO_IOMMU_F_BYPASS_CONFIG feature is offered and:
> > +\begin{itemize}
> > +  \item config field \field{bypass} is 1 and the endpoint is not
> > +attached to a domain, or
> > +  \item the endpoint is attached to a domain with
> > +VIRTIO_IOMMU_ATTACH_F_BYPASS.
> 
> should we mention that BYPASS_CONFIG effect doesn't depend on
> negotiation?

That's the distinction between "offered" vs "negotiated", but I can say it
explicitly


[...]
> >  Detach an endpoint from a domain. When this request completes, the
> > -endpoint cannot access any mapping from that domain anymore. If feature
> > -VIRTIO_IOMMU_F_BYPASS has been negotiated, then once this request
> > -completes all accesses from the endpoint are allowed and translated by the
> > -IOMMU using the identity function.
> > +endpoint cannot access any mapping from that domain anymore.
> 
> Related to my earlier question, what about global bypass is enabled? If
> domain attach is allowed in that configuration, then above statement
> is not accurate.

The statement only refers to domain mappings, so when global bypass is
enabled, after detach the endpoint can access everything. Eric also
commented on this, I could add:

However the endpoint may then be in bypass mode and access the
guest-physical address space.

So the reader can refer to the definition of bypass mode and see what
happens to unattached endpoints.

Thanks,
Jean



[virtio-dev] [PATCH v2] virtio-iommu: Rework the bypass feature

2021-10-11 Thread Jean-Philippe Brucker
The VIRTIO_IOMMU_F_BYPASS feature is awkward to use and incomplete.
Although it is implemented by QEMU, it is not supported by any driver as
far as I know. Replace it with a new VIRTIO_IOMMU_F_BYPASS_CONFIG
feature.

Two features are missing from virtio-iommu:

* The ability for an hypervisor to start the device in bypass mode. The
  wording for VIRTIO_IOMMU_F_BYPASS is not clear enough to allow it at
  the moment.

* The ability for a guest to set individual endpoints in bypass mode
  when bypass is globally disabled. An OS should have the ability to
  allow only endpoints it trusts to bypass the IOMMU, while keeping DMA
  disabled for endpoints it isn't even aware of. At the moment this can
  only be emulated by creating identity mappings.

The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a 'bypass' config field
that allows to enable and disable bypass globally. It also adds a new
flag for the ATTACH request.

* The hypervisor can start the VM with bypass enabled or, if it knows
  that the software stack supports it, disabled. The 'bypass' config
  fields resets to 0 or 1.

* Generally the firmware won't have an IOMMU driver and will need to be
  started in bypass mode, so the bootloader and kernel can be loaded
  from storage endpoint.

  For more security, the firmware could implement a minimal virtio-iommu
  driver that reuses existing virtio support and only touches the config
  space. It could enable PCI bus mastering in bridges only for the
  endpoints that need it, enable global IOMMU bypass by flipping a bit,
  then tear everything down before handing control over to the OS. This
  prevents vulnerability windows where a malicious endpoint reprograms
  the IOMMU while the OS is configuring it [1].

  The isolation provided by vIOMMUs has mainly been used for securely
  assigning endpoints to untrusted applications so far, while kernel DMA
  bypasses the IOMMU. But we can expect boot security to become as
  important in virtualization as it presently is on bare-metal systems,
  where some devices are untrusted and must never be able to access
  memory that wasn't assigned to them.

* The OS can enable and disable bypass globally. It can then enable
  bypass for individual endpoints by attaching them to bypass domains,
  using the new VIRTIO_IOMMU_ATTACH_F_BYPASS flag. It can disable bypass
  by attaching them to normal domains.

[1] IOMMU protection against I/O attacks: a vulnerability and a proof of concept
Morgan, B., Alata, É., Nicomette, V. et al.
https://link.springer.com/article/10.1186/s13173-017-0066-7

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/119
Signed-off-by: Jean-Philippe Brucker 
---

The virtio-iommu spec with colored diff is available at
https://jpbrucker.net/virtio-iommu/spec-bypass/virtio-iommu-f-bypass-config-v2-diff.pdf

v1: https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07817.html
v2:
* Keep description of VIRTIO_IOMMU_F_BYPASS, moved to Device operations
  and merged with VIRTIO_IOMMU_F_BYPASS_CONFIG.
  Added normative statements about which feature devices should offer.
* Describe initial and reset bypass value with normative statements.
* Addressed the other comments.
---
 conformance.tex  |   1 -
 virtio-iommu.tex | 102 +--
 2 files changed, 82 insertions(+), 21 deletions(-)

diff --git a/conformance.tex b/conformance.tex
index c52f1a4..7f70fc1 100644
--- a/conformance.tex
+++ b/conformance.tex
@@ -503,7 +503,6 @@ \section{Conformance Targets}\label{sec:Conformance / 
Conformance Targets}
 \begin{itemize}
 \item \ref{devicenormative:Device Types / IOMMU Device / Feature bits}
 \item \ref{devicenormative:Device Types / IOMMU Device / Device configuration 
layout}
-\item \ref{devicenormative:Device Types / IOMMU Device / Device Initialization}
 \item \ref{devicenormative:Device Types / IOMMU Device / Device operations}
 \item \ref{devicenormative:Device Types / IOMMU Device / Device operations / 
ATTACH request}
 \item \ref{devicenormative:Device Types / IOMMU Device / Device operations / 
DETACH request}
diff --git a/virtio-iommu.tex b/virtio-iommu.tex
index 08b358a..fcda138 100644
--- a/virtio-iommu.tex
+++ b/virtio-iommu.tex
@@ -59,14 +59,21 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU 
Device / Feature bits}
   VIRTIO_IOMMU_F_MAP_UNMAP is supported.}
 
 \item[VIRTIO_IOMMU_F_BYPASS (3)]
-  When not attached to a domain, endpoints downstream of the IOMMU
-  can access the guest-physical address space.
+  Endpoints that are not attached to a domain are in bypass mode.
 
 \item[VIRTIO_IOMMU_F_PROBE (4)]
   The PROBE request is available.
 
 \item[VIRTIO_IOMMU_F_MMIO (5)]
   The VIRTIO_IOMMU_MAP_F_MMIO flag is available.
+
+\item[VIRTIO_IOMMU_F_BYPASS_CONFIG (6)]
+  Field \field{bypass} of struct virtio_iommu_config determines
+  whether endpoints that are not attached to a domain are in
+  bypass mode. Flag VIRTIO_IOMMU_ATTACH_F_BYPASS determines
+  whether endpoints

Re: [virtio-dev] [PATCH] virtio-iommu: Rework the bypass feature

2021-10-07 Thread Jean-Philippe Brucker
Hi Eric,

On Wed, Oct 06, 2021 at 02:53:50PM +0200, Eric Auger wrote:
> > * The hypervisor can start the VM with bypass enabled or, if it knows
> >   that the software stack supports it, disabled. The 'bypass' config
> >   fields resets to 0 or 1.
> I don't see the reset value documented in the actual spec while it looks
> quite important for our boot bypass functionality. Or did I miss it?

No you're right, I'll add this. It's important to say that the device may
initialize the field to 1, but also that a reset (writing a 0 to device
status) doesn't re-initialize the field to its boot value (because that
may introduce a vulnerability during the FW->OS transition)

  If the device offers the VIRTIO_IOMMU_F_BYPASS_CONFIG feature, it MAY
  initialize the \field{bypass} field to 1. When the driver resets the
  device, the \field{bypass} field SHOULD NOT change.

[...]
> > diff --git a/virtio-iommu.tex b/virtio-iommu.tex
> > index efa735b..a2c90ea 100644
> > --- a/virtio-iommu.tex
> > +++ b/virtio-iommu.tex
> > @@ -59,14 +59,19 @@ \subsection{Feature bits}\label{sec:Device Types / 
> > IOMMU Device / Feature bits}
> >VIRTIO_IOMMU_F_MAP_UNMAP is supported.}
> >  
> >  \item[VIRTIO_IOMMU_F_BYPASS (3)]
> > -  When not attached to a domain, endpoints downstream of the IOMMU
> > -  can access the guest-physical address space.
> > +  This feature is deprecated.
> Why are you obliged to deprecate the feature.

We could leave the feature, but it adds complexity in the spec, and in
drivers if they have to support both. Maybe we can leave the feature,
merge its description with F_BYPASS_CONFIG and state something like:

  The VIRTIO_IOMMU_F_BYPASS_CONFIG feature supersedes
  VIRTIO_IOMMU_F_BYPASS. New devices SHOULD NOT offer
  VIRTIO_IOMMU_F_BYPASS.

> Can't you leave it and
> impose that either
> 
> VIRTIO_IOMMU_F_BYPASS or VIRTIO_IOMMU_F_BYPASS_CONFIG gets set?

Just to be clear, you mean devices must not implement both features at the
same time, right?  Not that devices must implement at least one of the
feature? (I don't think we can do that)

> 
> >  
> >  \item[VIRTIO_IOMMU_F_PROBE (4)]
> >The PROBE request is available.
> >  
> >  \item[VIRTIO_IOMMU_F_MMIO (5)]
> >The VIRTIO_IOMMU_MAP_F_MMIO flag is available.
> > +
> > +\item[VIRTIO_IOMMU_F_BYPASS_CONFIG (6)]
> > +  The global bypass state is described in \field{bypass} of
> > +  struct virtio_iommu_config. The domain bypass state is
> > +  described by VIRTIO_IOMMU_ATTACH_F_BYPASS.
> I would re-emphasize the global bypass state relates to bypass state of
> unattached devices whereas domain bypass state controls bypass for
> devices attached to a given domain.

Right, though I think I'll drop "bypass state" and describe everything
using "endpoint in bypass mode", so we can keep a single definition in one
place.

[...]
> >  \subsection{Device initialization}\label{sec:Device Types / IOMMU Device / 
> > Device initialization}
> >  
> >  When the device is reset, endpoints are not attached to any domain.
> >  
> > -If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all accesses from
> > -unattached endpoints are allowed and translated by the IOMMU using the
> > -identity function. If the feature is not negotiated, any memory access
> > -from an unattached endpoint fails. Upon attaching an endpoint in
> > -bypass mode to a new domain, any memory access from the endpoint fails,
> > -since the domain does not contain any mapping.
> > +An endpoint is in bypass mode if the VIRTIO_IOMMU_F_BYPASS_CONFIG
> > +feature is offered and:
> > +\begin{itemize}
> > +  \item config field \field{bypass} is 1 and the endpoint is not
> > +attached to a domain, or
> > +  \item the endpoint is attached to a domain with
> > +VIRTIO_IOMMU_ATTACH_F_BYPASS.
> > +\end{itemize}
> > +
> > +All accesses from an endpoint in bypass mode are allowed and
> > +translated by the IOMMU using the identity function.

I think I'll move this description to the Device operations section, not
sure why it was here in the first place.

> >  
> >  Future devices might support more modes of operation besides MAP/UNMAP.
> >  Drivers verify that devices set VIRTIO_IOMMU_F_MAP_UNMAP and fail
> > @@ -136,8 +156,8 @@ \subsection{Device initialization}\label{sec:Device 
> > Types / IOMMU Device / Devic
> >  
> >  \devicenormative{\subsubsection}{Device Initialization}{Device Types / 
> > IOMMU Device / Device Initialization}
> >  
> > -If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the
> > -device SHOULD NOT let endpoints access the guest-physical address space.
> > +The device SHOULD NOT let unattached endpoints that are not in
> > +bypass mode access the guest-physical address space.
> SHOULD NOT or MUST NOT. Can you remind me of the difference?

They are defined in rfc2119 https://datatracker.ietf.org/doc/html/rfc2119

SHOULD NOT
  This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist
  valid reasons in particular circumstances when the particular behavior
 

[virtio-dev] [PATCH v2] content: Remove duplicate paragraph

2021-10-06 Thread Jean-Philippe Brucker
It looks like commit 356aeeb40d7a ("content: add vendor specific cfg
type") had a merge issue. It includes the device normative paragraph for
Shared memory capability, which was already added right above it by
commit 855ad7af2bd6 ("shared memory: Define PCI capability").

The two paragraphs differ, and the first paragraph is correct. It refers
to struct virtio_pci_cap64 which embeds struct virtio_pci_cap:

  struct virtio_pci_cap64 {
struct virtio_pci_cap {
  ...
  le32 offset;
  le32 length;
} cap;
u32 offset_hi;
u32 length_hi;
  }

Merge the two paragraph while keeping the best of both. Drop the spaces
after \field to stay consistent with the rest of the document.

Acked-by: Michael S. Tsirkin 
Signed-off-by: Jean-Philippe Brucker 
---
v1: https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07815.html
---
 content.tex | 15 ---
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/content.tex b/content.tex
index 288e62f..39fe12b 100644
--- a/content.tex
+++ b/content.tex
@@ -1143,17 +1143,10 @@ \subsubsection{Shared memory 
capability}\label{sec:Virtio Transport Options / Vi
 
 \devicenormative{\paragraph}{Shared memory capability}{Virtio Transport 
Options / Virtio Over PCI Bus / PCI Device Layout / Shared memory capability}
 
-The region defined by the combination of \field{cap.offset},
-\field{offset_hi}, and \field {cap.length}, \field{length_hi}
-MUST be contained within the BAR specified by \field{cap.bar}.
-
-The \field{cap.id} MUST be unique for any one device instance.
-
-\devicenormative{\paragraph}{Shared memory capability}{Virtio Transport 
Options / Virtio Over PCI Bus / PCI Device Layout / Shared memory capability}
-
-The region defined by the combination of the \field {cap.offset},
-\field {cap.offset_hi}, and \field {cap.length}, \field
-{cap.length_hi} fields MUST be contained within the declared bar.
+The region defined by the combination of the \field{cap.offset},
+\field{offset_hi}, and \field{cap.length}, \field{length_hi}
+fields MUST be contained within the BAR specified by
+\field{cap.bar}.
 
 The \field{cap.id} MUST be unique for any one device instance.
 
-- 
2.33.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH] content: Remove duplicate paragraph

2021-10-06 Thread Jean-Philippe Brucker
On Mon, Oct 04, 2021 at 12:56:48PM +0200, Cornelia Huck wrote:
> On Thu, Sep 30 2021, Jean-Philippe Brucker  wrote:
> 
> > It looks like commit 356aeeb40d7a ("content: add vendor specific cfg
> > type") had a merge issue. It includes the device normative paragraph for
> > Shared memory capability, which was already added right above it by
> > commit 855ad7af2bd6 ("shared memory: Define PCI capability"). Remove the
> > duplicate paragraph.
> >
> > Signed-off-by: Jean-Philippe Brucker 
> > ---
> > This applies onto "Fix copy/paste bug in PCI transport paragraph", that
> > fixes both paragraph titles.
> > https://lists.oasis-open.org/archives/virtio-comment/202108/msg00097.html
> > ---
> >  content.tex | 8 
> >  1 file changed, 8 deletions(-)
> >
> > diff --git a/content.tex b/content.tex
> > index 4418ac8..54f7441 100644
> > --- a/content.tex
> > +++ b/content.tex
> > @@ -1149,14 +1149,6 @@ \subsubsection{Shared memory 
> > capability}\label{sec:Virtio Transport Options / Vi
> >  
> >  The \field{cap.id} MUST be unique for any one device instance.
> >  
> > -\devicenormative{\paragraph}{Shared memory capability}{Virtio Transport 
> > Options / Virtio Over PCI Bus / PCI Device Layout / Shared memory 
> > capability}
> > -
> > -The region defined by the combination of the \field {cap.offset},
> > -\field {cap.offset_hi}, and \field {cap.length}, \field
> > -{cap.length_hi} fields MUST be contained within the declared bar.
> > -
> > -The \field{cap.id} MUST be unique for any one device instance.
> > -
> >  \subsubsection{Vendor data capability}\label{sec:Virtio
> >  Transport Options / Virtio Over PCI Bus / PCI Device Layout /
> >  Vendor data capability}
> 
> Hm. This statement is indeed included twice, but the two are not exactly
> the same. For the first paragraph, do we want to use a combination of
> the two? IOW, something like
> 
> "The region defined by the combination of the \field {cap.offset},
> \field {cap.offset_hi}, and \field {cap.length}, \field
> {cap.length_hi} fields MUST be contained within the BAR specified by
> \field{cap.bar}."
> 

Makes sense, although thinking more about this it was the first paragraph
that was correct, because it refers to:

  struct virtio_pci_cap64 {
struct virtio_pci_cap {
  ...
  le32 offset;
  le32 length;
} cap;
u32 offset_hi;
u32 length_hi;
  }

So it would be \field{cap.offset} and \field{offset_hi}.
I'll send a v2

Thanks,
Jean


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH] virtio-iommu: Rework the bypass feature

2021-09-30 Thread Jean-Philippe Brucker
The VIRTIO_IOMMU_F_BYPASS feature is awkward to use and incomplete.
Although it is implemented by QEMU, it is not supported by any driver as
far as I know. Replace it with a new VIRTIO_IOMMU_F_BYPASS_CONFIG
feature. The old feature bit is deprecated and will be recycled once
versions of QEMU that implement it are not distributed anymore.

Two features are missing from virtio-iommu:

* The ability for an hypervisor to start the device in bypass mode. The
  wording for VIRTIO_IOMMU_F_BYPASS is not clear enough to allow it at
  the moment.

* The ability for a guest to set individual endpoints in bypass mode
  when bypass is globally disabled. An OS should have the ability to
  allow only endpoints it trusts to bypass the IOMMU, while keeping DMA
  disabled for endpoints it isn't even aware of. At the moment this can
  only be emulated by creating identity mappings.

The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a 'bypass' config field
that allows to enable and disable bypass globally. It also adds a new
flag for the ATTACH request.

* The hypervisor can start the VM with bypass enabled or, if it knows
  that the software stack supports it, disabled. The 'bypass' config
  fields resets to 0 or 1.

* Generally the firmware won't have an IOMMU driver and will need to be
  started in bypass mode, so the bootloader and kernel can be loaded
  from storage endpoint.

  For more security, the firmware could implement a minimal virtio-iommu
  driver that reuses existing virtio support and only touches the config
  space. It could enable PCI bus mastering in bridges only for the
  endpoints that need it, enable global IOMMU bypass by flipping a bit,
  then tear everything down before handing control over to the OS. This
  prevents vulnerability windows where a malicious endpoint reprograms
  the IOMMU while the OS is configuring it [1].

  The isolation provided by vIOMMUs has mainly been used for securely
  assigning endpoints to untrusted applications so far, while kernel DMA
  bypasses the IOMMU. But we can expect boot security to become as
  important in virtualization as it presently is on bare-metal systems,
  where some devices are untrusted and must never be able to access
  memory that wasn't assigned to them.

* The OS can enable and disable bypass globally. It can then enable
  bypass for individual endpoints by attaching them to bypass domains,
  using the new VIRTIO_IOMMU_ATTACH_F_BYPASS flag. It can disable bypass
  by attaching them to normal domains.

[1] IOMMU protection against I/O attacks: a vulnerability and a proof of concept
Morgan, B., Alata, É., Nicomette, V. et al.
https://link.springer.com/article/10.1186/s13173-017-0066-7

Signed-off-by: Jean-Philippe Brucker 
---
The virtio-iommu spec with colored diff is available at
https://jpbrucker.net/virtio-iommu/spec-bypass/virtio-iommu-f-bypass-config-v1-diff.pdf

Apologies for the poorly thought out VIRTIO_IOMMU_F_BYPASS. I didn't
spend enough time on it and ignored valuable suggestions.
---
 virtio-iommu.tex | 69 +---
 1 file changed, 53 insertions(+), 16 deletions(-)

diff --git a/virtio-iommu.tex b/virtio-iommu.tex
index efa735b..a2c90ea 100644
--- a/virtio-iommu.tex
+++ b/virtio-iommu.tex
@@ -59,14 +59,19 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU 
Device / Feature bits}
   VIRTIO_IOMMU_F_MAP_UNMAP is supported.}
 
 \item[VIRTIO_IOMMU_F_BYPASS (3)]
-  When not attached to a domain, endpoints downstream of the IOMMU
-  can access the guest-physical address space.
+  This feature is deprecated.
 
 \item[VIRTIO_IOMMU_F_PROBE (4)]
   The PROBE request is available.
 
 \item[VIRTIO_IOMMU_F_MMIO (5)]
   The VIRTIO_IOMMU_MAP_F_MMIO flag is available.
+
+\item[VIRTIO_IOMMU_F_BYPASS_CONFIG (6)]
+  The global bypass state is described in \field{bypass} of
+  struct virtio_iommu_config. The domain bypass state is
+  described by VIRTIO_IOMMU_ATTACH_F_BYPASS.
+
 \end{description}
 
 \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / 
Feature bits}
@@ -97,12 +102,19 @@ \subsection{Device configuration layout}\label{sec:Device 
Types / IOMMU Device /
 le32 end;
   } domain_range;
   le32 probe_size;
+  u8 bypass;
+  u8 reserved[3];
 };
 \end{lstlisting}
 
 \drivernormative{\subsubsection}{Device configuration layout}{Device Types / 
IOMMU Device / Device configuration layout}
 
-The driver MUST NOT write to device configuration fields.
+When the VIRTIO_IOMMU_F_BYPASS_CONFIG feature is negotiated, the
+driver MAY write to \field{bypass}. The driver MUST NOT write to
+any other device configuration field.
+
+If field \field{bypass} contains a value different than 0 or 1,
+the driver SHOULD treat it as 0.
 
 \devicenormative{\subsubsection}{Device configuration layout}{Device Types / 
IOMMU Device / Device configuration layout}
 
@@ -110,16 +122,24 @@ \subsection{Device configuration layout}\label{sec:Device 
Types / IOMMU Device /
 the page granularity

[virtio-dev] [PATCH] content: Remove duplicate paragraph

2021-09-30 Thread Jean-Philippe Brucker
It looks like commit 356aeeb40d7a ("content: add vendor specific cfg
type") had a merge issue. It includes the device normative paragraph for
Shared memory capability, which was already added right above it by
commit 855ad7af2bd6 ("shared memory: Define PCI capability"). Remove the
duplicate paragraph.

Signed-off-by: Jean-Philippe Brucker 
---
This applies onto "Fix copy/paste bug in PCI transport paragraph", that
fixes both paragraph titles.
https://lists.oasis-open.org/archives/virtio-comment/202108/msg00097.html
---
 content.tex | 8 
 1 file changed, 8 deletions(-)

diff --git a/content.tex b/content.tex
index 4418ac8..54f7441 100644
--- a/content.tex
+++ b/content.tex
@@ -1149,14 +1149,6 @@ \subsubsection{Shared memory 
capability}\label{sec:Virtio Transport Options / Vi
 
 The \field{cap.id} MUST be unique for any one device instance.
 
-\devicenormative{\paragraph}{Shared memory capability}{Virtio Transport 
Options / Virtio Over PCI Bus / PCI Device Layout / Shared memory capability}
-
-The region defined by the combination of the \field {cap.offset},
-\field {cap.offset_hi}, and \field {cap.length}, \field
-{cap.length_hi} fields MUST be contained within the declared bar.
-
-The \field{cap.id} MUST be unique for any one device instance.
-
 \subsubsection{Vendor data capability}\label{sec:Virtio
 Transport Options / Virtio Over PCI Bus / PCI Device Layout /
 Vendor data capability}
-- 
2.33.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH] virtio-iommu: Add VIRTIO_IOMMU_F_BOOT_BYPASS

2021-02-26 Thread Jean-Philippe Brucker
On Thu, Feb 25, 2021 at 02:11:17PM -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 25, 2021 at 06:53:15PM +0100, Jean-Philippe Brucker wrote:
> > Specify the behavior of the device before feature negotiation.
> > Implementations that allow DMA to bypass the IOMMU during boot inform
> > the driver by setting the VIRTIO_IOMMU_F_BOOT_BYPASS feature.
> > Negotiating the feature doesn't have any effect.
> 
> from spec text it kind of looks like it does, after
> FEATURES_OK devices are disallowed access?

Before FEATURES_OK the BOOT_BYPASS feature defines the policy chosen by
the device implementation.  After FEATURES_OK the driver overrides this
policy using the BYPASS feature.

Thinking more about this, we can't redefine F_BYPASS now (QEMU offers it),
but I'm tempted to deprecate it and replace it with a new feature bit that
indicates presence of a bypass field in config space. Device sets the byte
to 0 or 1 to declare its default bypass policy, and driver can override
this by writing 0 or 1 (currently done by accepting or refusing F_BYPASS).
It would be a lot cleaner than this.

Or just state that the boot-bypass behavior is up to the implementation
and leave it at that.

> > Clarify the description for VIRTIO_IOMMU_F_BYPASS while we're at it,
> > because "downstream of the IOMMU" is confusing.
> > 
> > Signed-off-by: Jean-Philippe Brucker 
> 
> are the two bypass features dependend on each other then?

No the device can offer one without the other.

Thanks,
Jean

> 
> > ---
> >  virtio-iommu.tex | 28 
> >  1 file changed, 20 insertions(+), 8 deletions(-)
> > 
> > diff --git a/virtio-iommu.tex b/virtio-iommu.tex
> > index 08b358a..4f34a14 100644
> > --- a/virtio-iommu.tex
> > +++ b/virtio-iommu.tex
> > @@ -59,7 +59,7 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU 
> > Device / Feature bits}
> >VIRTIO_IOMMU_F_MAP_UNMAP is supported.}
> >  
> >  \item[VIRTIO_IOMMU_F_BYPASS (3)]
> > -  When not attached to a domain, endpoints downstream of the IOMMU
> > +  When not attached to a domain, endpoints managed by the IOMMU
> >can access the guest-physical address space.
> >  
> >  \item[VIRTIO_IOMMU_F_PROBE (4)]
> > @@ -67,6 +67,10 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU 
> > Device / Feature bits}
> >  
> >  \item[VIRTIO_IOMMU_F_MMIO (5)]
> >The VIRTIO_IOMMU_MAP_F_MMIO flag is available.
> > +
> > +\item[VIRTIO_IOMMU_F_BOOT_BYPASS (6)]
> > +  Before feature negotiation, endpoints managed by the IOMMU
> > +  can access the guest-physical address space.
> >  \end{description}
> >  
> >  \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device 
> > / Feature bits}
> > @@ -114,12 +118,15 @@ \subsection{Device initialization}\label{sec:Device 
> > Types / IOMMU Device / Devic
> >  
> >  When the device is reset, endpoints are not attached to any domain.
> >  
> > -If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all accesses from
> > -unattached endpoints are allowed and translated by the IOMMU using the
> > -identity function. If the feature is not negotiated, any memory access
> > -from an unattached endpoint fails. Upon attaching an endpoint in
> > -bypass mode to a new domain, any memory access from the endpoint fails,
> > -since the domain does not contain any mapping.
> > +Memory accesses from an endpoint bypass the IOMMU, that is all
> > +accesses are allowed and translated using the identity function,
> > +in the following cases:
> > +\begin{itemize}
> > +\item If the VIRTIO_IOMMU_F_BOOT_BYPASS feature is offered and
> > +  the FEATURES_OK status bit is not set.
> 
> confused. so this feature *only* has effect before FEATURES_OK?
> 
> 
> > +\item If the VIRTIO_IOMMU_F_BYPASS feature is negotiated and the
> > +  endpoint is not attached to a domain.
> > +\end{itemize}
> >  
> >  Future devices might support more modes of operation besides MAP/UNMAP.
> >  Drivers verify that devices set VIRTIO_IOMMU_F_MAP_UNMAP and fail
> > @@ -136,8 +143,13 @@ \subsection{Device initialization}\label{sec:Device 
> > Types / IOMMU Device / Devic
> >  
> >  \devicenormative{\subsubsection}{Device Initialization}{Device Types / 
> > IOMMU Device / Device Initialization}
> >  
> > +If the device does not offer the VIRTIO_IOMMU_F_BOOT_BYPASS
> > +feature, it SHOULD NOT let endpoints access the guest-physical
> > +address space before feature negotiation is complete.
> > +
> >  If the driver does not accept the VI

[virtio-dev] [PATCH] virtio-iommu: Add VIRTIO_IOMMU_F_BOOT_BYPASS

2021-02-25 Thread Jean-Philippe Brucker
Specify the behavior of the device before feature negotiation.
Implementations that allow DMA to bypass the IOMMU during boot inform
the driver by setting the VIRTIO_IOMMU_F_BOOT_BYPASS feature.
Negotiating the feature doesn't have any effect.

Clarify the description for VIRTIO_IOMMU_F_BYPASS while we're at it,
because "downstream of the IOMMU" is confusing.

Signed-off-by: Jean-Philippe Brucker 
---
 virtio-iommu.tex | 28 
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/virtio-iommu.tex b/virtio-iommu.tex
index 08b358a..4f34a14 100644
--- a/virtio-iommu.tex
+++ b/virtio-iommu.tex
@@ -59,7 +59,7 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU 
Device / Feature bits}
   VIRTIO_IOMMU_F_MAP_UNMAP is supported.}
 
 \item[VIRTIO_IOMMU_F_BYPASS (3)]
-  When not attached to a domain, endpoints downstream of the IOMMU
+  When not attached to a domain, endpoints managed by the IOMMU
   can access the guest-physical address space.
 
 \item[VIRTIO_IOMMU_F_PROBE (4)]
@@ -67,6 +67,10 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU 
Device / Feature bits}
 
 \item[VIRTIO_IOMMU_F_MMIO (5)]
   The VIRTIO_IOMMU_MAP_F_MMIO flag is available.
+
+\item[VIRTIO_IOMMU_F_BOOT_BYPASS (6)]
+  Before feature negotiation, endpoints managed by the IOMMU
+  can access the guest-physical address space.
 \end{description}
 
 \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / 
Feature bits}
@@ -114,12 +118,15 @@ \subsection{Device initialization}\label{sec:Device Types 
/ IOMMU Device / Devic
 
 When the device is reset, endpoints are not attached to any domain.
 
-If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all accesses from
-unattached endpoints are allowed and translated by the IOMMU using the
-identity function. If the feature is not negotiated, any memory access
-from an unattached endpoint fails. Upon attaching an endpoint in
-bypass mode to a new domain, any memory access from the endpoint fails,
-since the domain does not contain any mapping.
+Memory accesses from an endpoint bypass the IOMMU, that is all
+accesses are allowed and translated using the identity function,
+in the following cases:
+\begin{itemize}
+\item If the VIRTIO_IOMMU_F_BOOT_BYPASS feature is offered and
+  the FEATURES_OK status bit is not set.
+\item If the VIRTIO_IOMMU_F_BYPASS feature is negotiated and the
+  endpoint is not attached to a domain.
+\end{itemize}
 
 Future devices might support more modes of operation besides MAP/UNMAP.
 Drivers verify that devices set VIRTIO_IOMMU_F_MAP_UNMAP and fail
@@ -136,8 +143,13 @@ \subsection{Device initialization}\label{sec:Device Types 
/ IOMMU Device / Devic
 
 \devicenormative{\subsubsection}{Device Initialization}{Device Types / IOMMU 
Device / Device Initialization}
 
+If the device does not offer the VIRTIO_IOMMU_F_BOOT_BYPASS
+feature, it SHOULD NOT let endpoints access the guest-physical
+address space before feature negotiation is complete.
+
 If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the
-device SHOULD NOT let endpoints access the guest-physical address space.
+device SHOULD NOT let endpoints access the guest-physical address space
+after feature negotiation is complete.
 
 \subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device 
operations}
 
-- 
2.30.1


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH 4/4] virtio-iommu: Add Arm 64-bit page tables

2021-02-05 Thread Jean-Philippe Brucker
Allow a driver to detect support for Arm page tables and attach a set of
tables.

When the host IOMMU is an Arm SMMUv2, a single address space per
endpoint is supported. The driver detects the Arm LPAE page table format
in a PROBE request and sends an ATTACH_TABLE request with a pointer to
the page directory.

When the host IOMMU is an Arm SMMUv3, an intermediate "Context
Descriptor" (CD) table has to be attached. That table is indexed by
PASID and contains multiple pointers to page directories:

 CD table  Page directory
+---+   ,--->+---+
:   :   |:   :
+---+   |+---+
 PASID->|  CD   |---'VA->|  PTE  |--> GPA
+---++---+
:   ::   :
+---++---+

The driver detects support for both table formats in a PROBE request and
sends an ATTACH_TABLE request with a pointer to the context descriptor
table.

These features are voluntarily omitted:
* Stall, which cannot be supported without recoverable page faults. I do
  plan to implement this once we get to the I/O Page fault extensions.
* Big endian page tables.
* AArch32 LPAE format and per-armv8 short descriptor format.
* Page-based Hardware Attributes (PBHA), Auxiliary Memory Attributes
  (AMAIR)
* Memory Partioning And Monitoring (MPAM)

Support for any of these could easily be added later, by adding flags to
the PROBE property and fields in the ATTACH_TABLE config. However, most
of these extensions might never be needed.

Signed-off-by: Jean-Philippe Brucker 
---
 introduction.tex |   6 ++
 virtio-iommu.tex | 180 +++
 2 files changed, 186 insertions(+)

diff --git a/introduction.tex b/introduction.tex
index bc0217f..1603726 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -76,6 +76,12 @@ \section{Normative References}\label{sec:Normative 
References}
\phantomsection\label{intro:I2C}\textbf{[I2C]} &
I2C-bus specification and user manual,
\newline\url{https://www.nxp.com/docs/en/user-guide/UM10204.pdf}\\
+   \phantomsection\label{intro:SMMUv3}\textbf{[SMMUv3]} &
+   Arm System Memory Management Unit version 3
+   \newline\url{https://developer.arm.com/documentation/ihi0070/latest} \\
+   \phantomsection\label{intro:Armv8-A}\textbf{[ARMv8-A]} &
+   Armv8-A Architecture Reference Manual
+   \newline\url{https://developer.arm.com/documentation/ddi0487/latest} \\
 
 \end{longtable}
 
diff --git a/virtio-iommu.tex b/virtio-iommu.tex
index e14bd67..bc18342 100644
--- a/virtio-iommu.tex
+++ b/virtio-iommu.tex
@@ -1176,3 +1176,183 @@ \subsubsection{Fault reporting}\label{sec:Device Types 
/ IOMMU Device / Device o
 \footnotetext{This would happen for example if the device implements a
 more recent version of this specification, whose fault report contains
 additional fields.}
+
+\subsection{Table formats}\label{sec:Device Types / IOMMU Device / Table 
Formats}
+
+Supported table formats in PROBE properties and ATTACH_TABLE
+requests are:
+\begin{description}
+  \item[VIRTIO_IOMMU_PST_ARM_SMMU3 (1)] Arm SMMUv3 Context
+Descriptor Tables.
+  \item[VIRTIO_IOMMU_PGT_ARM64 (2)] Arm VMSAv8-64 page tables.
+\end{description}
+
+\subsubsection{Arm SMMUv3 Context Descriptor table}\label{sec:Device Types / 
IOMMU Device / Table Formats / Arm SMMUv3 Context Descriptor}
+
+Attach Context Descriptor tables (PASID tables) in the format
+described in the \hyperref[intro:SMMUv3]{Arm System Memory
+Management Unit v3 specification}. The table contains context
+descriptors indexed by PASID, each pointing to a page directory.
+
+\paragraph{PROBE properties for Arm SMMUv3 Context Descriptor 
tables}\label{sec:Device Types / IOMMU Device / Table Formats / Arm SMMUv3 
Context Descriptor / PROBE}
+
+\begin{lstlisting}
+struct virtio_iommu_probe_pst_arm_smmu3 {
+  struct virtio_iommu_probe_property head;
+  le16  format;
+  u8reserved[2];
+  le64  flags;
+};
+
+#define VIRTIO_IOMMU_PST_ARM_SMMU3_F_BTM(1ULL << 0)
+\end{lstlisting}
+
+Supported flags are:
+\begin{description}
+  \item[VIRTIO_IOMMU_PST_ARM_SMMU3_F_BTM] Broadcast TLB
+maintenance is supported. INVALIDATE requests for
+\field{caches} VIRTIO_IOMMU_INVAL_C_TLB can be replaced by
+broadcast TLBI instructions. INVALIDATE requests for
+\field{caches} VIRTIO_IOMMU_INVAL_C_EP_TLB, if the endpoint
+has PCIe ATS enabled, are still necessary.
+\end{description}
+
+\paragraph{ATTACH_TABLE request for Arm SMMUv3 Context Descriptor 
tables}\label{sec:Device Types / IOMMU Device / Table Formats / Arm SMMUv3 
Context Descriptor / ATTACH_TABLE}
+
+\begin{lstlisting}
+struct virtio_iommu_req_attach_pst_arm_smmu3 {
+  struct virtio_iommu_req_head head;
+  le32  domain;
+  le32  endpoint;
+  le16  format;
+  u8s1fmt;
+  u8s1dss;
+  le6

[virtio-dev] [PATCH 3/4] virtio-iommu: Add PROBE properties for tables

2021-02-05 Thread Jean-Philippe Brucker
Add several PROBE properties required for attaching tables to a domain.
The MAP_UNMAP interface is simple enough that the same global properties
can be used for all endpoints managed by the device. This won't be the
case for the table interface. Hardware and emulated endpoints have
difference properties, which have to be described in PROBE requests.

* PAGE_SIZE_MASK: supported mapping granularity
* INPUT_RANGE: input address size (IOVA)
* OUTPUT_SIZE: output address size (GPA)
* PASID_SIZE: Process Address Space ID size. To know the maximum size of
  a PASID table.
* TABLE_FORMAT: PASID or page table format. These will be described in
  subsequent patches.

Signed-off-by: Jean-Philippe Brucker 
---
 virtio-iommu.tex | 159 ++-
 1 file changed, 158 insertions(+), 1 deletion(-)

diff --git a/virtio-iommu.tex b/virtio-iommu.tex
index 1ca2638..e14bd67 100644
--- a/virtio-iommu.tex
+++ b/virtio-iommu.tex
@@ -871,7 +871,12 @@ \subsubsection{PROBE request}\label{sec:Device Types / 
IOMMU Device / Device ope
 \subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / 
Device operations / PROBE properties}
 
 \begin{lstlisting}
-#define VIRTIO_IOMMU_PROBE_T_RESV_MEM   1
+#define VIRTIO_IOMMU_PROBE_T_RESV_MEM   1
+#define VIRTIO_IOMMU_PROBE_T_PAGE_SIZE_MASK 2
+#define VIRTIO_IOMMU_PROBE_T_INPUT_RANGE3
+#define VIRTIO_IOMMU_PROBE_T_OUTPUT_SIZE4
+#define VIRTIO_IOMMU_PROBE_T_PASID_SIZE 5
+#define VIRTIO_IOMMU_PROBE_T_TABLE_FORMAT   6
 \end{lstlisting}
 
 \paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device 
operations / PROBE properties / RESVMEM}
@@ -934,6 +939,158 @@ \subsubsection{PROBE properties}\label{sec:Device Types / 
IOMMU Device / Device
 The device SHOULD NOT allow accesses from the endpoint to RESV_MEM regions
 to affect any other component than the endpoint and the driver.
 
+
+\paragraph{Property PAGE_SIZE_MASK}\label{sec:Device Types / IOMMU Device / 
Device operations / PROBE properties / PAGE_SIZE_MASK}
+
+\begin{lstlisting}
+struct virtio_iommu_probe_page_size_mask {
+  struct virtio_iommu_probe_property head;
+  u8reserved[4];
+  le64  page_size_mask;
+};
+\end{lstlisting}
+
+The PAGE_SIZE_MASK property overrides the global
+\field{page_size_mask} configuration for an endpoint.
+
+The \field{page_size_mask} field behaves in the same way as the
+global \field{page_size_mask} field, described in \ref{sec:Device
+Types / IOMMU Device / Device operations}. The least significant
+bit describes the mapping granularity, while additional bits are
+hints.
+
+\drivernormative{\subparagraph}{Property PAGE_SIZE_MASK}{Device Types / IOMMU 
Device / Device operations / PROBE properties / PAGE_SIZE_MASK}
+
+The driver MUST ignore \field{reserved}.
+
+\devicenormative{\subparagraph}{Property PAGE_SIZE_MASK}{Device Types / IOMMU 
Device / Device operations / PROBE properties / PAGE_SIZE_MASK}
+
+The device SHOULD set \field{reserved} to zero.
+
+The device MAY present multiple PAGE_SIZE_MASK property per
+endpoint.
+
+
+\paragraph{Property INPUT_RANGE}\label{sec:Device Types / IOMMU Device / 
Device operations / PROBE properties / INPUT_RANGE}
+
+\begin{lstlisting}
+struct virtio_iommu_probe_input_range {
+  struct virtio_iommu_probe_property head;
+  u8reserved[4];
+  le64  start;
+  le64  end;
+};
+\end{lstlisting}
+
+The INPUT_RANGE property overrides the global \field{input_range}
+configuration for an endpoint.
+
+Fields \field{start} and \field{end} behave in the same way as
+the global \field{input_range.start} and \field{input_range.end}
+fields, described in \ref{sec:Device Types / IOMMU Device /
+Device operations}.
+
+\drivernormative{\subparagraph}{Property INPUT_RANGE}{Device Types / IOMMU 
Device / Device operations / PROBE properties / INPUT_RANGE}
+
+The driver MUST ignore \field{reserved}.
+
+\devicenormative{\subparagraph}{Property INPUT_RANGE}{Device Types / IOMMU 
Device / Device operations / PROBE properties / INPUT_RANGE}
+
+The device SHOULD set \field{reserved} to zero.
+
+The device SHOULD NOT present more than one INPUT_RANGE property
+per endpoint.
+
+The device MAY present an INPUT_RANGE property even if it does
+not offer the VIRTIO_IOMMU_F_INPUT_RANGE feature.
+
+
+\paragraph{Property OUTPUT_SIZE}\label{sec:Device Types / IOMMU Device / 
Device operations / PROBE properties / OUTPUT_SIZE}
+
+\begin{lstlisting}
+struct virtio_iommu_probe_output_size {
+  struct virtio_iommu_probe_property head;
+  u8   bits_count;
+  u8   reserved[3];
+};
+\end{lstlisting}
+
+The OUTPUT_SIZE property describes the maximum guest-physical
+address that the device supports for this endpoint.
+\field{bits_count} is the number of bits that an address can use.
+
+\drivernormative{\subparagraph}{Property OUTPUT_SIZE}{Device Types / IOMMU 
Device / Device operations / PROBE properties / OUTPUT_SIZE}
+
+The driver MUST ignore \field{reserved}.
+
+The driver SHOULD NOT use guest-physical addresses larger

[virtio-dev] [PATCH 2/4] virtio-iommu: Add INVALIDATE request

2021-02-05 Thread Jean-Philippe Brucker
Add a request that invalidates mappings of a domain attached with table.

Signed-off-by: Jean-Philippe Brucker 
---
 virtio-iommu.tex | 151 +++
 1 file changed, 151 insertions(+)

diff --git a/virtio-iommu.tex b/virtio-iommu.tex
index 840f897..1ca2638 100644
--- a/virtio-iommu.tex
+++ b/virtio-iommu.tex
@@ -169,6 +169,7 @@ \subsection{Device operations}\label{sec:Device Types / 
IOMMU Device / Device op
 #define VIRTIO_IOMMU_T_UNMAP4
 #define VIRTIO_IOMMU_T_PROBE5
 #define VIRTIO_IOMMU_T_ATTACH_TABLE 6
+#define VIRTIO_IOMMU_T_INVALIDATE   7
 \end{lstlisting}
 
 A few general-purpose status codes are defined here.
@@ -612,6 +613,156 @@ \subsubsection{UNMAP request}\label{sec:Device Types / 
IOMMU Device / Device ope
 mapping, then the device SHOULD remove all mappings affected by the range
 and set the request \field{status} to VIRTIO_IOMMU_S_OK.
 
+\subsubsection{INVALIDATE request}\label{sec:Device Types / IOMMU Device / 
Device operations / INVALIDATE request}
+
+\begin{lstlisting}
+struct virtio_iommu_req_invalidate {
+  struct virtio_iommu_req_head head;
+  u8scope;
+  u8caches;
+  le16  flags;
+  le32  domain;
+  le32  pasid;
+  le64  id;
+  le64  virt_start;
+  le64  nr_pages;
+  u8page_size;
+  u8reserved[19];
+  struct virtio_iommu_req_tail tail;
+};
+
+#define VIRTIO_IOMMU_INVAL_S_DOMAIN   1
+#define VIRTIO_IOMMU_INVAL_S_PASID2
+#define VIRTIO_IOMMU_INVAL_S_ADDRESS  3
+
+#define VIRTIO_IOMMU_INVAL_C_PASID(1 << 0)
+#define VIRTIO_IOMMU_INVAL_C_TLB  (1 << 1)
+#define VIRTIO_IOMMU_INVAL_C_EP_TLB   (1 << 2)
+
+#define VIRTIO_IOMMU_INVAL_F_LEAF (1 << 0)
+#define VIRTIO_IOMMU_INVAL_F_PASID(1 << 1)
+#define VIRTIO_IOMMU_INVAL_F_ID   (1 << 2)
+
+\end{lstlisting}
+
+Invalidate a mapping or configuration. When using PASID or page
+tables, the driver sends an INVALIDATE request to signal changes
+to table elements that could have been cached by the device, such
+as a mapping removal.
+
+Field \field{scope} specifies which entries to invalidate:
+\begin{description}
+  \item[VIRTIO_IOMMU_INVAL_S_DOMAIN] invalidates all cached
+entries for the given \field{domain}.
+  \item[VIRTIO_IOMMU_INVAL_S_PASID] invalidates all cached
+entries for the given \field{domain} and \field{pasid}.
+  \item[VIRTIO_IOMMU_INVAL_S_ADDRESS] invalidates all cached
+entries for the given \field{domain}, \field{pasid} and
+virtual address range.
+\end{description}
+
+Field \field{caches} specifies which caches to invalidate:
+\begin{description}
+  \item[VIRTIO_IOMMU_INVAL_C_PASID] Invalidate entries cached
+from the PASID table.
+  \item[VIRTIO_IOMMU_INVAL_C_TLB] Invalidate entries from the
+Translation Lookaside Buffer (TLB).
+  \item[VIRTIO_IOMMU_INVAL_C_EP_TLB] Invalidate entries from the
+TLB of endpoints attached to the domain. For example when a
+PCIe endpoint has an Address Translation Cache (ATC),
+discovered and enabled through the PCIe ATS capability.
+\end{description}
+
+For example, when the driver removes a mapping from a table
+attached to the \field{domain}, it sends an INVALIDATE request
+with \field{scope} VIRTIO_IOMMU_INVAL_S_ADDRESS, \field{caches}
+VIRTIO_IOMMU_INVAL_C_TLB and, if necessary,
+VIRTIO_IOMMU_INVAL_C_EP_TLB. To invalidate the whole address
+space the driver sets the \field{scope} to
+VIRTIO_IOMMU_INVAL_S_DOMAIN, or to VIRTIO_IOMMU_INVAL_S_PASID
+when using a PASID table. When removing a page table directory
+from a PASID table entry, the driver sends an INVALIDATE request
+with \field{scope} VIRTIO_IOMMU_INVAL_S_PASID, and all
+\field{caches} flags set.
+
+Field \field{flags} specifies additional information:
+\begin{description}
+  \item[VIRTIO_IOMMU_INVAL_F_LEAF] Only invalidate TLB entries
+cached from leaf table entries.
+  \item[VIRTIO_IOMMU_INVAL_F_PASID] The \field{pasid} field is
+valid.
+  \item[VIRTIO_IOMMU_INVAL_F_ID] The \field{id} field is valid.
+\end{description}
+
+Use of field \field{id} is specific to the table format. Some
+formats use only the PASID to identify address spaces within a
+domain, others use a separate ID for TLB entries.
+
+For a \field{scope} of VIRTIO_IOMMU_INVAL_S_ADDRESS, the
+invalidation affects the range of virtual addresses of size
+$\field{nr\_pages} \times 2^\field{page\_size}$, starting at
+\field{virt_start}. Specifying the range size this way allows the
+device to efficiently remove TLB entries. For example a page
+table format could allow a 2MiB range to be mapped with either
+512 4KiB pages, or a single 2MiB block. In the latter case a
+single TLB entry is used, so the driver specifies a
+\field{page_size} of 21 and the device does not need to iterate
+over all 4KiB multiples to invalidate the range.
+
+The following table describes the restricted set of valid
+\field{caches}, \field{flags} and fields for each \field{scope}:
+
+\begin{tabular}{|l|c|c|c|}
+\hline
+  \fiel

[virtio-dev] [PATCH 1/4] virtio-iommu: Add ATTACH_TABLE request

2021-02-05 Thread Jean-Philippe Brucker
Add a way to attach page tables to a domain. Only introduces the feature
bit and the ATTACH_TABLE request. We'll also need new PROBE properties
and INVALIDATE requests.

I chose ATTACH_TABLE request rather than ATTACH followed by SET_TABLE
because it leads to fewer domain states. With ATTACH_TABLE a domain can
be free, attached-with-map or attached-with-pgtable. With
ATTACH+SET_TABLE, a domain can be free, attached, attached-with-map or
attached-with-pgtable. I believe managing state transitions for
ATTACH_TABLE is simpler.

Signed-off-by: Jean-Philippe Brucker 
---
 virtio-iommu.tex | 74 ++--
 1 file changed, 65 insertions(+), 9 deletions(-)

diff --git a/virtio-iommu.tex b/virtio-iommu.tex
index efa735b..840f897 100644
--- a/virtio-iommu.tex
+++ b/virtio-iommu.tex
@@ -54,9 +54,7 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU 
Device / Feature bits}
   The number of domains supported is described in \field{domain_range}.
 
 \item[VIRTIO_IOMMU_F_MAP_UNMAP (2)]
-  Map and unmap requests are available.\footnote{Future extensions may add
-  different modes of operations. At the moment, only
-  VIRTIO_IOMMU_F_MAP_UNMAP is supported.}
+  Map and unmap requests are available.
 
 \item[VIRTIO_IOMMU_F_BYPASS (3)]
   When not attached to a domain, endpoints downstream of the IOMMU
@@ -67,6 +65,9 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU 
Device / Feature bits}
 
 \item[VIRTIO_IOMMU_F_MMIO (5)]
   The VIRTIO_IOMMU_MAP_F_MMIO flag is available.
+
+\item[VIRTIO_IOMMU_F_ATTACH_TABLE (6)]
+  The ATTACH_TABLE and INVALIDATE requests are available.
 \end{description}
 
 \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / 
Feature bits}
@@ -77,7 +78,8 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU 
Device / Feature bits}
 
 \devicenormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / 
Feature bits}
 
-The device SHOULD offer feature bit VIRTIO_IOMMU_F_MAP_UNMAP.
+The device MUST offer at least one of the
+VIRTIO_IOMMU_F_MAP_UNMAP or VIRTIO_IOMMU_F_ATTACH_TABLE features.
 
 \subsection{Device configuration layout}\label{sec:Device Types / IOMMU Device 
/ Device configuration layout}
 
@@ -161,11 +163,12 @@ \subsection{Device operations}\label{sec:Device Types / 
IOMMU Device / Device op
 Type may be one of:
 
 \begin{lstlisting}
-#define VIRTIO_IOMMU_T_ATTACH 1
-#define VIRTIO_IOMMU_T_DETACH 2
-#define VIRTIO_IOMMU_T_MAP3
-#define VIRTIO_IOMMU_T_UNMAP  4
-#define VIRTIO_IOMMU_T_PROBE  5
+#define VIRTIO_IOMMU_T_ATTACH   1
+#define VIRTIO_IOMMU_T_DETACH   2
+#define VIRTIO_IOMMU_T_MAP  3
+#define VIRTIO_IOMMU_T_UNMAP4
+#define VIRTIO_IOMMU_T_PROBE5
+#define VIRTIO_IOMMU_T_ATTACH_TABLE 6
 \end{lstlisting}
 
 A few general-purpose status codes are defined here.
@@ -301,6 +304,10 @@ \subsubsection{ATTACH request}\label{sec:Device Types / 
IOMMU Device / Device op
 The driver SHOULD ensure that endpoints that cannot be isolated from each
 other are attached to the same domain.
 
+The driver SHOULD NOT create the \field{domain} with an ATTACH
+request if the VIRTIO_IOMMU_F_MAP_UNMAP feature was not
+negotiated. In this case the ATTACH_TABLE request is used.
+
 \devicenormative{\paragraph}{ATTACH request}{Device Types / IOMMU Device / 
Device operations / ATTACH request}
 
 If the \field{reserved} field of an ATTACH request is not zero, the device
@@ -330,6 +337,49 @@ \subsubsection{ATTACH request}\label{sec:Device Types / 
IOMMU Device / Device op
 
 A device that does not reject the request MUST attach the endpoint.
 
+\subsubsection{ATTACH_TABLE request}\label{ref:Device Types / IOMMU Device / 
Device operations / ATTACH_TABLE request}
+
+\begin{lstlisting}
+struct virtio_iommu_req_attach_table {
+   struct virtio_iommu_req_head head;
+   le32 domain;
+   le32 endpoint;
+   le16 format;
+   u8   descriptor[62];
+   struct virtio_iommu_req_tail tail;
+};
+\end{lstlisting}
+
+Attach an endpoint to a domain, in the same way as an ATTACH
+request. In addition, provide a pointer to a table describing the
+mappings. Instead of using MAP and UNMAP requests, the driver
+creates and removes mappings by writing into the tables. When a
+mapping that may have been cached in a TLB is removed, the driver
+sends an INVALIDATE request.
+
+The driver discovers with a PROBE request the table formats
+supported by the device. The device can support multiple page
+tables and PASID table formats. A PASID table contains pointers
+to one or more page tables. The driver chooses a format it
+recognises, and sends the ATTACH_TABLE request.
+
+Apart from \field{domain}, \field{endpoint} and \field{format},
+each table format uses different parameters in field
+\field{descriptor}. Table formats and their associated parameters
+are described in section \ref{sec:Device Types / IOMMU Device /
+Table Formats}.
+
+\drivernormative{\paragraph}{ATTACH_TABLE

[virtio-dev] [PATCH 0/4] virtio-iommu: Support page tables

2021-02-05 Thread Jean-Philippe Brucker
Add page table support to the virtio-iommu specification, to enable
acceleration and later Shared Virtual Memory.

The interface currently offered by virtio-iommu is simple to implement
and sufficiently efficient for solutions based on static mappings, such
as DPDK running in guest userspace. While that's an essential use-case
for virtio-iommu, it is not the only DMA remapping usage, and the
MAP/UNMAP interface performs rather poorly for dynamic mappings.
Unsurprisingly a given vIOMMU model cannot be simple, efficient and
secure for all possible use-cases. It's always a compromise.

This extension adds:

* New PROBE properties that describe table formats available for each
  endpoints.
* An ATTACH_TABLE request to be used in place of an ATTACH, that
  provides a guest-physical address to a table.
* An INVALIDATE request to signal destructive changes.

Those tables may be:

(1) Page tables that are read directly by the hardware IOMMU, for
  hardware endpoints. An example for Arm SMMU is given in patch 4. The
  driver writes mappings into the table, and sends invalidations when
  removing mappings that could have been cached in an IOTLB. This
  divides the number of context switches to the host by at least two
  (much more with lazy invalidation), and with a vhost IOMMU the
  invalidation itself should be relatively cheap compared to an emulated
  viommu.

  I don't expect a large proliferation of table formats because they are
  relatively stable. A few extensible formats is more likely: two tables
  for Arm until they run out of PTE bits, one or two page tables for
  x86, and a couple for other architectures.

(2) Tables that are read by a software component performing DMA, for
  emulated and paravirtual endpoints. These could be your garden-variety
  page tables from (1) but being paravirtual we have the opportunity to
  design more efficient solutions. For example a virtio/vhost endpoint
  accesses rings often and leaf buffers once. If the virtio-iommu device
  avoids caching leaf buffers, the driver does not need to send
  invalidations and we get almost no context switches for vIOMMU
  operations. A purposely designed page table can provide a handshake
  between device and driver to avoid sending invalidations for uncached
  mappings.

(3) Tables that complement the MAP_UNMAP interface to help reduce
  context switches. The coIOMMU model [1] provides a DMA tracking table
  for co-operative pinning.


An example driver implementation was submitted by Vivek Gautam [2],
based on an older version of this extension (0.10-dev03). Since then
I've reworked some of the structures and improved the wording. To ease
review I uploaded a PDF with these changes, plus one with diff colors,
here:

https://jpbrucker.net/virtio-iommu/spec/virtio-iommu-v0.13.pdf
https://jpbrucker.net/virtio-iommu/spec/virtio-iommu-diff-v0.12-v0.13.pdf


[1] https://www.usenix.org/system/files/atc20-tian.pdf
[2] 
https://lore.kernel.org/linux-iommu/20210115121342.15093-1-vivek.gau...@arm.com/

Jean-Philippe Brucker (4):
  virtio-iommu: Add ATTACH_TABLE request
  virtio-iommu: Add INVALIDATE request
  virtio-iommu: Add PROBE properties for tables
  virtio-iommu: Add Arm 64-bit page tables

 introduction.tex |   6 +
 virtio-iommu.tex | 564 ++-
 2 files changed, 560 insertions(+), 10 deletions(-)

-- 
2.30.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 0/6] Add virtio-iommu built-in topology

2020-11-04 Thread Jean-Philippe Brucker
Hi Al,

On Tue, Nov 03, 2020 at 01:09:04PM -0700, Al Stone wrote:
> So, there are some questions about the VIOT definition and I just
> don't know enough to be able to answer them.  One of the ASWG members
> is trying to understand the semantics behind the subtables.

Thanks for the update. We dropped subtables a few versions ago, though, do
you have the latest v8?
https://jpbrucker.net/virtio-iommu/viot/viot-v8.pdf

> Is there a particular set of people, or mailing lists, that I can
> point to to get the questions answered?  Ideally it would be one
> of the public lists where it has already been discussed, but an
> individual would be fine, too.  No changes have been proposed, just
> some questions asked.

For a public list, I suggest io...@lists.linux-foundation.org if we should
pick only one (otherwise add virtualizat...@lists.linux-foundation.org and
virtio-dev@lists.oasis-open.org). I'm happy to answer any question, and
the folks on here are a good set to Cc:

eric.au...@redhat.com
jean-phili...@linaro.org
j...@8bytes.org
kevin.t...@intel.com
lorenzo.pieral...@arm.com
m...@redhat.com
sebastien.bo...@intel.com

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 0/6] Add virtio-iommu built-in topology

2020-09-25 Thread Jean-Philippe Brucker
On Fri, Sep 25, 2020 at 06:22:57AM -0400, Michael S. Tsirkin wrote:
> On Fri, Sep 25, 2020 at 10:48:06AM +0200, Jean-Philippe Brucker wrote:
> > On Fri, Aug 21, 2020 at 03:15:34PM +0200, Jean-Philippe Brucker wrote:
> > > Add a topology description to the virtio-iommu driver and enable x86
> > > platforms.
> > > 
> > > Since [v2] we have made some progress on adding ACPI support for
> > > virtio-iommu, which is the preferred boot method on x86. It will be a
> > > new vendor-agnostic table describing para-virtual topologies in a
> > > minimal format. However some platforms don't use either ACPI or DT for
> > > booting (for example microvm), and will need the alternative topology
> > > description method proposed here. In addition, since the process to get
> > > a new ACPI table will take a long time, this provides a boot method even
> > > to ACPI-based platforms, if only temporarily for testing and
> > > development.
> > > 
> > > v3:
> > > * Add patch 1 that moves virtio-iommu to a subfolder.
> > > * Split the rest:
> > >   * Patch 2 adds topology-helper.c, which will be shared with the ACPI
> > > support.
> > >   * Patch 4 adds definitions.
> > >   * Patch 5 adds parser in topology.c.
> > > * Address other comments.
> > > 
> > > Linux and QEMU patches available at:
> > > https://jpbrucker.net/git/linux virtio-iommu/devel
> > > https://jpbrucker.net/git/qemu virtio-iommu/devel
> > 
> > I'm parking this work again, until we make progress on the ACPI table, or
> > until a platform without ACPI and DT needs it. Until then, I've pushed v4
> > to my virtio-iommu/topo branch and will keep it rebased on master.
> > 
> > Thanks,
> > Jean
> 
> I think you guys need to work on virtio spec too, not too much left to
> do there ...

I know it's ready and I'd really like to move on with this, but I'd rather
not commit it to the spec until we know it's going to be used at all. As
Gerd pointed out the one example we had, microvm, now supports ACPI. Since
we've kicked off the ACPI work anyway it isn't clear that the built-in
topology will be useful.

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 0/6] Add virtio-iommu built-in topology

2020-09-25 Thread Jean-Philippe Brucker
On Fri, Aug 21, 2020 at 03:15:34PM +0200, Jean-Philippe Brucker wrote:
> Add a topology description to the virtio-iommu driver and enable x86
> platforms.
> 
> Since [v2] we have made some progress on adding ACPI support for
> virtio-iommu, which is the preferred boot method on x86. It will be a
> new vendor-agnostic table describing para-virtual topologies in a
> minimal format. However some platforms don't use either ACPI or DT for
> booting (for example microvm), and will need the alternative topology
> description method proposed here. In addition, since the process to get
> a new ACPI table will take a long time, this provides a boot method even
> to ACPI-based platforms, if only temporarily for testing and
> development.
> 
> v3:
> * Add patch 1 that moves virtio-iommu to a subfolder.
> * Split the rest:
>   * Patch 2 adds topology-helper.c, which will be shared with the ACPI
> support.
>   * Patch 4 adds definitions.
>   * Patch 5 adds parser in topology.c.
> * Address other comments.
> 
> Linux and QEMU patches available at:
> https://jpbrucker.net/git/linux virtio-iommu/devel
> https://jpbrucker.net/git/qemu virtio-iommu/devel

I'm parking this work again, until we make progress on the ACPI table, or
until a platform without ACPI and DT needs it. Until then, I've pushed v4
to my virtio-iommu/topo branch and will keep it rebased on master.

Thanks,
Jean


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 5/6] iommu/virtio: Support topology description in config space

2020-09-25 Thread Jean-Philippe Brucker
On Thu, Sep 24, 2020 at 10:22:03AM -0500, Bjorn Helgaas wrote:
> On Fri, Aug 21, 2020 at 03:15:39PM +0200, Jean-Philippe Brucker wrote:
> > Platforms without device-tree nor ACPI can provide a topology
> > description embedded into the virtio config space. Parse it.
> > 
> > Use PCI FIXUP to probe the config space early, because we need to
> > discover the topology before any DMA configuration takes place, and the
> > virtio driver may be loaded much later. Since we discover the topology
> > description when probing the PCI hierarchy, the virtual IOMMU cannot
> > manage other platform devices discovered earlier.
> 
> > +struct viommu_cap_config {
> > +   u8 bar;
> > +   u32 length; /* structure size */
> > +   u32 offset; /* structure offset within the bar */
> 
> s/the bar/the BAR/ (to match comment below).
> 
> > +static void viommu_pci_parse_topology(struct pci_dev *dev)
> > +{
> > +   int ret;
> > +   u32 features;
> > +   void __iomem *regs, *common_regs;
> > +   struct viommu_cap_config cap = {0};
> > +   struct virtio_pci_common_cfg __iomem *common_cfg;
> > +
> > +   /*
> > +* The virtio infrastructure might not be loaded at this point. We need
> > +* to access the BARs ourselves.
> > +*/
> > +   ret = viommu_pci_find_capability(dev, VIRTIO_PCI_CAP_COMMON_CFG, );
> > +   if (!ret) {
> > +   pci_warn(dev, "common capability not found\n");
> 
> Is the lack of this capability really an error, i.e., is this
> pci_warn() or pci_info()?  The "device doesn't have topology
> description" below is only pci_dbg(), which suggests that we can live
> without this.

At this point we know that this is a (modern) virtio-pci device which,
according to the virtio 1.0 specification, must have this capability. So
this is definitely an error, but the topology description is an optional
feature.

> 
> Maybe a hint about what "common capability" means?

Yes, "virtio-pci common configuration capability" would be more
appropriate

> 
> > +   return;
> > +   }
> > +
> > +   if (pci_enable_device_mem(dev))
> > +   return;
> > +
> > +   common_regs = pci_iomap(dev, cap.bar, 0);
> > +   if (!common_regs)
> > +   return;
> > +
> > +   common_cfg = common_regs + cap.offset;
> > +
> > +   /* Perform the init sequence before we can read the config */
> > +   ret = viommu_pci_reset(common_cfg);
> 
> I guess this is some special device-specific reset, not any kind of
> standard PCI reset?

Yes it's the virtio reset - writing 0 to the status register in the BAR.

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 0/6] Add virtio-iommu built-in topology

2020-09-24 Thread Jean-Philippe Brucker
On Thu, Sep 24, 2020 at 12:02:55PM +0200, Joerg Roedel wrote:
> On Thu, Sep 24, 2020 at 05:38:13AM -0400, Michael S. Tsirkin wrote:
> > On Thu, Sep 24, 2020 at 11:21:29AM +0200, Joerg Roedel wrote:
> > > On Thu, Sep 24, 2020 at 05:00:35AM -0400, Michael S. Tsirkin wrote:
> > > > OK so this looks good. Can you pls repost with the minor tweak
> > > > suggested and all acks included, and I will queue this?
> > > 
> > > My NACK still stands, as long as a few questions are open:
> > > 
> > >   1) The format used here will be the same as in the ACPI table? I
> > >  think the answer to this questions must be Yes, so this leads
> > >  to the real question:
> > 
> > I am not sure it's a must.
> 
> It is, having only one parser for the ACPI and MMIO descriptions was one
> of the selling points for MMIO in past discussions and I think it makes
> sense to keep them in sync.

It's not possible to use exactly the same code for parsing. The access
methods are different (need to deal with port-IO for built-in description
on PCI, for example) and more importantly, the structure is different as
well. The ACPI table needs nodes for virtio-iommu while the built-in
description is contained in the virtio-iommu itself. So the endpoint nodes
point to virtio-iommu node on ACPI, while they don't need a pointer on the
built-in desc. I kept as much as possible common in structures and
implementation, but in the end we still need about 200 unique lines on
each side.

Thanks,
Jean

> 
> > We can always tweak the parser if there are slight differences
> > between ACPI and virtio formats.
> 
> There is no guarantee that there only need to be "tweaks" until the
> ACPI table format is stablized.
> 
> Regards,
> 
>   Joerg
> 

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 5/6] iommu/virtio: Support topology description in config space

2020-09-24 Thread Jean-Philippe Brucker
On Fri, Sep 04, 2020 at 06:05:35PM +0200, Auger Eric wrote:
> > +static inline int viommu_pci_find_capability(struct pci_dev *dev, u8 
> > cfg_type,
> > +struct viommu_cap_config *cap)
> not sure the inline is useful here

No, I'll drop it

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 2/6] iommu/virtio: Add topology helpers

2020-09-24 Thread Jean-Philippe Brucker
On Fri, Sep 04, 2020 at 06:22:12PM +0200, Auger Eric wrote:
> > +/**
> > + * virt_dma_configure - Configure DMA of virtualized devices
> > + * @dev: the endpoint
> > + *
> > + * Setup the DMA and IOMMU ops of a virtual device, for platforms without 
> > DT or
> > + * ACPI.
> > + *
> > + * Return: -EPROBE_DEFER if the device is managed by an IOMMU that hasn't 
> > been
> > + *   probed yet, 0 otherwise
> > + */
> > +int virt_dma_configure(struct device *dev)
> > +{
> > +   const struct iommu_ops *iommu_ops;
> > +
> > +   iommu_ops = virt_iommu_setup(dev);
> > +   if (IS_ERR_OR_NULL(iommu_ops)) {
> > +   int ret = PTR_ERR(iommu_ops);
> > +
> > +   if (ret == -EPROBE_DEFER || ret == 0)
> > +   return ret;
> > +   dev_err(dev, "error %d while setting up virt IOMMU\n", ret);
> > +   return 0;
> why do we return 0 here?

So we can fall back to another method (ACPI or DT) if available

> Besides
> 
> Reviewed-by: Eric Auger 

Thanks!
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 0/6] Add virtio-iommu built-in topology

2020-08-27 Thread Jean-Philippe Brucker
On Wed, Aug 26, 2020 at 09:26:02AM -0400, Michael S. Tsirkin wrote:
> On Fri, Aug 21, 2020 at 03:15:34PM +0200, Jean-Philippe Brucker wrote:
> > Add a topology description to the virtio-iommu driver and enable x86
> > platforms.
> > 
> > Since [v2] we have made some progress on adding ACPI support for
> > virtio-iommu, which is the preferred boot method on x86. It will be a
> > new vendor-agnostic table describing para-virtual topologies in a
> > minimal format. However some platforms don't use either ACPI or DT for
> > booting (for example microvm), and will need the alternative topology
> > description method proposed here. In addition, since the process to get
> > a new ACPI table will take a long time, this provides a boot method even
> > to ACPI-based platforms, if only temporarily for testing and
> > development.
> 
> OK should I park this in next now? Seems appropriate ...

Yes that sounds like a good idea. It could uncover new bugs since there is
more automated testing happening for x86.

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 4/6] iommu/virtio: Add topology definitions

2020-08-21 Thread Jean-Philippe Brucker
Add struct definitions for describing endpoints managed by the
virtio-iommu. When VIRTIO_IOMMU_F_TOPOLOGY is offered, an array of
virtio_iommu_topo_* structures in config space describes the endpoints,
identified either by their PCI BDF or their physical MMIO address.

Signed-off-by: Jean-Philippe Brucker 
---
 include/uapi/linux/virtio_iommu.h | 44 +++
 1 file changed, 44 insertions(+)

diff --git a/include/uapi/linux/virtio_iommu.h 
b/include/uapi/linux/virtio_iommu.h
index 237e36a280cb..70cba30644d5 100644
--- a/include/uapi/linux/virtio_iommu.h
+++ b/include/uapi/linux/virtio_iommu.h
@@ -16,6 +16,7 @@
 #define VIRTIO_IOMMU_F_BYPASS  3
 #define VIRTIO_IOMMU_F_PROBE   4
 #define VIRTIO_IOMMU_F_MMIO5
+#define VIRTIO_IOMMU_F_TOPOLOGY6
 
 struct virtio_iommu_range_64 {
__le64  start;
@@ -27,6 +28,17 @@ struct virtio_iommu_range_32 {
__le32  end;
 };
 
+struct virtio_iommu_topo_config {
+   /* Number of topology description structures */
+   __le16  count;
+   /*
+* Offset to the first topology description structure
+* (virtio_iommu_topo_*) from the start of the virtio_iommu config
+* space. Aligned on 8 bytes.
+*/
+   __le16  offset;
+};
+
 struct virtio_iommu_config {
/* Supported page sizes */
__le64  page_size_mask;
@@ -36,6 +48,38 @@ struct virtio_iommu_config {
struct virtio_iommu_range_32domain_range;
/* Probe buffer size */
__le32  probe_size;
+   struct virtio_iommu_topo_config topo_config;
+};
+
+#define VIRTIO_IOMMU_TOPO_PCI_RANGE0x1
+#define VIRTIO_IOMMU_TOPO_MMIO 0x2
+
+struct virtio_iommu_topo_pci_range {
+   /* VIRTIO_IOMMU_TOPO_PCI_RANGE */
+   __u8type;
+   __u8reserved;
+   /* Length of this structure */
+   __le16  length;
+   /* First endpoint ID in the range */
+   __le32  endpoint_start;
+   /* PCI domain number */
+   __le16  segment;
+   /* PCI Bus:Device.Function range */
+   __le16  bdf_start;
+   __le16  bdf_end;
+   __le16  padding;
+};
+
+struct virtio_iommu_topo_mmio {
+   /* VIRTIO_IOMMU_TOPO_MMIO */
+   __u8type;
+   __u8reserved;
+   /* Length of this structure */
+   __le16  length;
+   /* Endpoint ID */
+   __le32  endpoint;
+   /* Address of the first MMIO region */
+   __le64  address;
 };
 
 /* Request types */
-- 
2.28.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 1/6] iommu/virtio: Move to drivers/iommu/virtio/

2020-08-21 Thread Jean-Philippe Brucker
Before adding new files to the virtio-iommu driver, move it to its own
subfolder, similarly to other IOMMU drivers.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/Makefile| 3 +--
 drivers/iommu/virtio/Makefile | 2 ++
 drivers/iommu/{ => virtio}/virtio-iommu.c | 0
 MAINTAINERS   | 2 +-
 4 files changed, 4 insertions(+), 3 deletions(-)
 create mode 100644 drivers/iommu/virtio/Makefile
 rename drivers/iommu/{ => virtio}/virtio-iommu.c (100%)

diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 11f1771104f3..fc7523042512 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-y += amd/ intel/ arm/
+obj-y += amd/ intel/ arm/ virtio/
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
@@ -26,4 +26,3 @@ obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
 obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
 obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
 obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
-obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
diff --git a/drivers/iommu/virtio/Makefile b/drivers/iommu/virtio/Makefile
new file mode 100644
index ..279368fcc074
--- /dev/null
+++ b/drivers/iommu/virtio/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio/virtio-iommu.c
similarity index 100%
rename from drivers/iommu/virtio-iommu.c
rename to drivers/iommu/virtio/virtio-iommu.c
diff --git a/MAINTAINERS b/MAINTAINERS
index deaafb617361..3602b223c9b2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18451,7 +18451,7 @@ VIRTIO IOMMU DRIVER
 M:     Jean-Philippe Brucker 
 L: virtualizat...@lists.linux-foundation.org
 S: Maintained
-F: drivers/iommu/virtio-iommu.c
+F: drivers/iommu/virtio/
 F: include/uapi/linux/virtio_iommu.h
 
 VIRTIO MEM DRIVER
-- 
2.28.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 5/6] iommu/virtio: Support topology description in config space

2020-08-21 Thread Jean-Philippe Brucker
Platforms without device-tree nor ACPI can provide a topology
description embedded into the virtio config space. Parse it.

Use PCI FIXUP to probe the config space early, because we need to
discover the topology before any DMA configuration takes place, and the
virtio driver may be loaded much later. Since we discover the topology
description when probing the PCI hierarchy, the virtual IOMMU cannot
manage other platform devices discovered earlier.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/Kconfig   |  12 ++
 drivers/iommu/virtio/Makefile   |   1 +
 drivers/iommu/virtio/topology.c | 259 
 3 files changed, 272 insertions(+)
 create mode 100644 drivers/iommu/virtio/topology.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index e29ae50f7100..98d28fdbc19a 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -394,4 +394,16 @@ config VIRTIO_IOMMU
 config VIRTIO_IOMMU_TOPOLOGY_HELPERS
bool
 
+config VIRTIO_IOMMU_TOPOLOGY
+   bool "Handle topology properties from the virtio-iommu"
+   depends on VIRTIO_IOMMU
+   depends on PCI
+   default y
+   select VIRTIO_IOMMU_TOPOLOGY_HELPERS
+   help
+ Enable early probing of virtio-iommu devices to detect the built-in
+ topology description.
+
+ Say Y here if you intend to run this kernel as a guest.
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/virtio/Makefile b/drivers/iommu/virtio/Makefile
index b42ad47eac7e..1eda8ca1cbbf 100644
--- a/drivers/iommu/virtio/Makefile
+++ b/drivers/iommu/virtio/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
+obj-$(CONFIG_VIRTIO_IOMMU_TOPOLOGY) += topology.o
 obj-$(CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS) += topology-helpers.o
diff --git a/drivers/iommu/virtio/topology.c b/drivers/iommu/virtio/topology.c
new file mode 100644
index ..4923eec618b9
--- /dev/null
+++ b/drivers/iommu/virtio/topology.c
@@ -0,0 +1,259 @@
+// SPDX-License-Identifier: GPL-2.0
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "topology-helpers.h"
+
+struct viommu_cap_config {
+   u8 bar;
+   u32 length; /* structure size */
+   u32 offset; /* structure offset within the bar */
+};
+
+struct viommu_topo_header {
+   u8 type;
+   u8 reserved;
+   u16 length;
+};
+
+static struct virt_topo_endpoint *
+viommu_parse_node(void __iomem *buf, size_t len)
+{
+   int ret = -EINVAL;
+   union {
+   struct viommu_topo_header hdr;
+   struct virtio_iommu_topo_pci_range pci;
+   struct virtio_iommu_topo_mmio mmio;
+   } __iomem *cfg = buf;
+   struct virt_topo_endpoint *spec;
+
+   spec = kzalloc(sizeof(*spec), GFP_KERNEL);
+   if (!spec)
+   return ERR_PTR(-ENOMEM);
+
+   switch (ioread8(>hdr.type)) {
+   case VIRTIO_IOMMU_TOPO_PCI_RANGE:
+   if (len < sizeof(cfg->pci))
+   goto err_free;
+
+   spec->dev_id.type = VIRT_TOPO_DEV_TYPE_PCI;
+   spec->dev_id.segment = ioread16(>pci.segment);
+   spec->dev_id.bdf_start = ioread16(>pci.bdf_start);
+   spec->dev_id.bdf_end = ioread16(>pci.bdf_end);
+   spec->endpoint_id = ioread32(>pci.endpoint_start);
+   break;
+   case VIRTIO_IOMMU_TOPO_MMIO:
+   if (len < sizeof(cfg->mmio))
+   goto err_free;
+
+   spec->dev_id.type = VIRT_TOPO_DEV_TYPE_MMIO;
+   spec->dev_id.base = ioread64(>mmio.address);
+   spec->endpoint_id = ioread32(>mmio.endpoint);
+   break;
+   default:
+   pr_warn("unhandled format 0x%x\n", ioread8(>hdr.type));
+   ret = 0;
+   goto err_free;
+   }
+   return spec;
+
+err_free:
+   kfree(spec);
+   return ERR_PTR(ret);
+}
+
+static int viommu_parse_topology(struct device *dev,
+struct virtio_iommu_config __iomem *cfg,
+size_t max_len)
+{
+   int ret;
+   u16 len;
+   size_t i;
+   LIST_HEAD(endpoints);
+   size_t offset, count;
+   struct virt_topo_iommu *viommu;
+   struct virt_topo_endpoint *ep, *next;
+   struct viommu_topo_header __iomem *cur;
+
+   offset = ioread16(>topo_config.offset);
+   count = ioread16(>topo_config.count);
+   if (!offset || !count)
+   return 0;
+
+   viommu = kzalloc(sizeof(*viommu), GFP_KERNEL);
+   if (!viommu)
+   return -ENOMEM;
+
+   viommu->dev = dev;
+
+   for (i = 0; i < count; i++, offset += len) {
+   if (offset + sizeof(*cur) > max_len) {
+   ret

[virtio-dev] [PATCH v3 0/6] Add virtio-iommu built-in topology

2020-08-21 Thread Jean-Philippe Brucker
Add a topology description to the virtio-iommu driver and enable x86
platforms.

Since [v2] we have made some progress on adding ACPI support for
virtio-iommu, which is the preferred boot method on x86. It will be a
new vendor-agnostic table describing para-virtual topologies in a
minimal format. However some platforms don't use either ACPI or DT for
booting (for example microvm), and will need the alternative topology
description method proposed here. In addition, since the process to get
a new ACPI table will take a long time, this provides a boot method even
to ACPI-based platforms, if only temporarily for testing and
development.

v3:
* Add patch 1 that moves virtio-iommu to a subfolder.
* Split the rest:
  * Patch 2 adds topology-helper.c, which will be shared with the ACPI
support.
  * Patch 4 adds definitions.
  * Patch 5 adds parser in topology.c.
* Address other comments.

Linux and QEMU patches available at:
https://jpbrucker.net/git/linux virtio-iommu/devel
https://jpbrucker.net/git/qemu virtio-iommu/devel

[spec] https://lists.oasis-open.org/archives/virtio-dev/202008/msg00067.html
[v2] 
https://lore.kernel.org/linux-iommu/20200228172537.377327-1-jean-phili...@linaro.org/
[v1] 
https://lore.kernel.org/linux-iommu/20200214160413.1475396-1-jean-phili...@linaro.org/
[rfc] 
https://lore.kernel.org/linux-iommu/20191122105000.800410-1-jean-phili...@linaro.org/

Jean-Philippe Brucker (6):
  iommu/virtio: Move to drivers/iommu/virtio/
  iommu/virtio: Add topology helpers
  PCI: Add DMA configuration for virtual platforms
  iommu/virtio: Add topology definitions
  iommu/virtio: Support topology description in config space
  iommu/virtio: Enable x86 support

 drivers/iommu/Kconfig |  18 +-
 drivers/iommu/Makefile|   3 +-
 drivers/iommu/virtio/Makefile |   4 +
 drivers/iommu/virtio/topology-helpers.h   |  50 +
 include/linux/virt_iommu.h|  15 ++
 include/uapi/linux/virtio_iommu.h |  44 
 drivers/iommu/virtio/topology-helpers.c   | 196 
 drivers/iommu/virtio/topology.c   | 259 ++
 drivers/iommu/{ => virtio}/virtio-iommu.c |   4 +
 drivers/pci/pci-driver.c  |   5 +
 MAINTAINERS   |   3 +-
 11 files changed, 597 insertions(+), 4 deletions(-)
 create mode 100644 drivers/iommu/virtio/Makefile
 create mode 100644 drivers/iommu/virtio/topology-helpers.h
 create mode 100644 include/linux/virt_iommu.h
 create mode 100644 drivers/iommu/virtio/topology-helpers.c
 create mode 100644 drivers/iommu/virtio/topology.c
 rename drivers/iommu/{ => virtio}/virtio-iommu.c (99%)

-- 
2.28.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 3/6] PCI: Add DMA configuration for virtual platforms

2020-08-21 Thread Jean-Philippe Brucker
Hardware platforms usually describe the IOMMU topology using either
device-tree pointers or vendor-specific ACPI tables.  For virtual
platforms that don't provide a device-tree, the virtio-iommu device
contains a description of the endpoints it manages.  That information
allows us to probe endpoints after the IOMMU is probed (possibly as late
as userspace modprobe), provided it is discovered early enough.

Add a hook to pci_dma_configure(), which returns -EPROBE_DEFER if the
endpoint is managed by a vIOMMU that will be loaded later, or 0 in any
other case to avoid disturbing the normal DMA configuration methods.
When CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS isn't selected, the call to
virt_dma_configure() is compiled out.

As long as the information is consistent, platforms can provide both a
device-tree and a built-in topology, and the IOMMU infrastructure is
able to deal with multiple DMA configuration methods.

Acked-by: Bjorn Helgaas 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/pci/pci-driver.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 449466f71040..dbe9d33606b0 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "pci.h"
 #include "pcie/portdrv.h"
 
@@ -1605,6 +1606,10 @@ static int pci_dma_configure(struct device *dev)
struct device *bridge;
int ret = 0;
 
+   ret = virt_dma_configure(dev);
+   if (ret)
+   return ret;
+
bridge = pci_get_host_bridge_device(to_pci_dev(dev));
 
if (IS_ENABLED(CONFIG_OF) && bridge->parent &&
-- 
2.28.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 2/6] iommu/virtio: Add topology helpers

2020-08-21 Thread Jean-Philippe Brucker
To support topology description from ACPI and from the builtin
description, add helpers to keep track of I/O topology descriptors.

To ease re-use of the helpers by other drivers and future ACPI
extensions, use the "virt_" prefix rather than "virtio_" when naming
structs and functions.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/Kconfig   |   3 +
 drivers/iommu/virtio/Makefile   |   1 +
 drivers/iommu/virtio/topology-helpers.h |  50 ++
 include/linux/virt_iommu.h  |  15 ++
 drivers/iommu/virtio/topology-helpers.c | 196 
 drivers/iommu/virtio/virtio-iommu.c |   4 +
 MAINTAINERS |   1 +
 7 files changed, 270 insertions(+)
 create mode 100644 drivers/iommu/virtio/topology-helpers.h
 create mode 100644 include/linux/virt_iommu.h
 create mode 100644 drivers/iommu/virtio/topology-helpers.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index bef5d75e306b..e29ae50f7100 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -391,4 +391,7 @@ config VIRTIO_IOMMU
 
  Say Y here if you intend to run this kernel as a guest.
 
+config VIRTIO_IOMMU_TOPOLOGY_HELPERS
+   bool
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/virtio/Makefile b/drivers/iommu/virtio/Makefile
index 279368fcc074..b42ad47eac7e 100644
--- a/drivers/iommu/virtio/Makefile
+++ b/drivers/iommu/virtio/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
+obj-$(CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS) += topology-helpers.o
diff --git a/drivers/iommu/virtio/topology-helpers.h 
b/drivers/iommu/virtio/topology-helpers.h
new file mode 100644
index ..436ca6a900c5
--- /dev/null
+++ b/drivers/iommu/virtio/topology-helpers.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef TOPOLOGY_HELPERS_H_
+#define TOPOLOGY_HELPERS_H_
+
+#ifdef CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS
+
+/* Identify a device node in the topology */
+struct virt_topo_dev_id {
+   unsigned inttype;
+#define VIRT_TOPO_DEV_TYPE_PCI 1
+#define VIRT_TOPO_DEV_TYPE_MMIO2
+   union {
+   /* PCI endpoint or range */
+   struct {
+   u16 segment;
+   u16 bdf_start;
+   u16 bdf_end;
+   };
+   /* MMIO region */
+   u64 base;
+   };
+};
+
+/* Specification of an IOMMU */
+struct virt_topo_iommu {
+   struct virt_topo_dev_id dev_id;
+   struct device   *dev; /* transport device */
+   struct fwnode_handle*fwnode;
+   struct iommu_ops*ops;
+   struct list_headlist;
+};
+
+/* Specification of an endpoint */
+struct virt_topo_endpoint {
+   struct virt_topo_dev_id dev_id;
+   u32 endpoint_id;
+   struct virt_topo_iommu  *viommu;
+   struct list_headlist;
+};
+
+void virt_topo_add_endpoint(struct virt_topo_endpoint *ep);
+void virt_topo_add_iommu(struct virt_topo_iommu *viommu);
+
+void virt_topo_set_iommu_ops(struct device *dev, struct iommu_ops *ops);
+
+#else /* !CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS */
+static inline void virt_topo_set_iommu_ops(struct device *dev, struct 
iommu_ops *ops)
+{ }
+#endif /* !CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS */
+#endif /* TOPOLOGY_HELPERS_H_ */
diff --git a/include/linux/virt_iommu.h b/include/linux/virt_iommu.h
new file mode 100644
index ..17d2bd4732e0
--- /dev/null
+++ b/include/linux/virt_iommu.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef VIRT_IOMMU_H_
+#define VIRT_IOMMU_H_
+
+#ifdef CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS
+int virt_dma_configure(struct device *dev);
+
+#else /* !CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS */
+static inline int virt_dma_configure(struct device *dev)
+{
+   /* Don't disturb the normal DMA configuration methods */
+   return 0;
+}
+#endif /* !CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS */
+#endif /* VIRT_IOMMU_H_ */
diff --git a/drivers/iommu/virtio/topology-helpers.c 
b/drivers/iommu/virtio/topology-helpers.c
new file mode 100644
index ..8815e3a5d431
--- /dev/null
+++ b/drivers/iommu/virtio/topology-helpers.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "topology-helpers.h"
+
+static LIST_HEAD(viommus);
+static LIST_HEAD(pci_endpoints);
+static LIST_HEAD(mmio_endpoints);
+static DEFINE_MUTEX(viommus_lock);
+
+static bool virt_topo_device_match(struct device *dev,
+  struct virt_topo_dev_id *id)
+{
+   if (id->type == VIRT_TOPO_DEV_TYPE_PCI && dev_is_pci(dev)) {
+ 

[virtio-dev] [PATCH v3 6/6] iommu/virtio: Enable x86 support

2020-08-21 Thread Jean-Philippe Brucker
With the built-in topology description in place, x86 platforms can now
use the virtio-iommu.

Architectures that use the generic iommu_dma_ops should normally select
CONFIG_IOMMU_DMA themselves (from arch/*/Kconfig). Since not all x86
drivers have been converted yet, it's currently up to the IOMMU Kconfig
to select it.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 98d28fdbc19a..d7cf158745eb 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -383,8 +383,9 @@ config HYPERV_IOMMU
 config VIRTIO_IOMMU
tristate "Virtio IOMMU driver"
depends on VIRTIO
-   depends on ARM64
+   depends on (ARM64 || X86)
select IOMMU_API
+   select IOMMU_DMA if X86
select INTERVAL_TREE
help
  Para-virtualised IOMMU driver with virtio.
-- 
2.28.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH virtio v2] virtio-iommu: Add built-in topology description

2020-08-20 Thread Jean-Philippe Brucker
Add a simple method to describe the IOMMU topology in the config space,
guarded by a new feature bit. A list of capabilities in the config space
describes the devices managed by the IOMMU and their endpoint IDs.

As outlined in paragraph #1 below, the method used to describe I/O
topology depends on the platform. Device-Tree based platforms use the
"iommus" and "iommu-map" DT properties, while ACPI based platforms will
use the VIOT table (whose format is compatible with this built-in
description, but is under discussion). This adds a method for
lightweight platforms that don't use either DT or ACPI.

Signed-off-by: Jean-Philippe Brucker 
---
v2:
* Precise what \field{length} should contain (Eric)
* Rename num_items -> count
v1: https://lists.oasis-open.org/archives/virtio-dev/202007/msg00023.html
---
 virtio-iommu.tex | 102 +++
 1 file changed, 102 insertions(+)

diff --git a/virtio-iommu.tex b/virtio-iommu.tex
index 08b358a..82254b1 100644
--- a/virtio-iommu.tex
+++ b/virtio-iommu.tex
@@ -67,6 +67,9 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU 
Device / Feature bits}
 
 \item[VIRTIO_IOMMU_F_MMIO (5)]
   The VIRTIO_IOMMU_MAP_F_MMIO flag is available.
+
+\item[VIRTIO_IOMMU_F_TOPOLOGY (6)]
+  Topology description is available at \field{topo_config}.
 \end{description}
 
 \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / 
Feature bits}
@@ -97,6 +100,10 @@ \subsection{Device configuration layout}\label{sec:Device 
Types / IOMMU Device /
 le32 end;
   } domain_range;
   le32 probe_size;
+  struct virtio_iommu_topo_config {
+le16 count;
+le16 offset;
+  } topo_config;
 };
 \end{lstlisting}
 
@@ -139,6 +146,101 @@ \subsection{Device initialization}\label{sec:Device Types 
/ IOMMU Device / Devic
 If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the
 device SHOULD NOT let endpoints access the guest-physical address space.
 
+\subsubsection{Built-in topology description}\label{sec:Device Types / IOMMU 
Device / Device initialization / topology}
+
+The device manages memory accesses from endpoints, identified by endpoint
+IDs. The driver can discover which endpoint ID corresponds to an endpoint
+using a method specific to the platform. Platforms described with device
+tree use the \texttt{iommus} and \texttt{iommu-map} properties embedded
+into device nodes for this purpose. Platforms described with ACPI use a
+table such as the Virtual I/O Translation Table (VIOT). Platforms that do
+not support either device tree or ACPI may embed a minimalistic
+description in the device configuration space, declared by the
+VIRTIO_IOMMU_F_TOPOLOGY feature.
+
+A disadvantage of describing the topology from within the device is the
+lack of initialization ordering information. Out-of-band descriptions such
+as device tree and ACPI let the operating system know about device
+dependencies so that it can initialize supplier devices (IOMMUs) before
+their consumers (endpoints). Platforms using the VIRTIO_IOMMU_F_TOPOLOGY
+feature have to communicate the device dependency in another way.
+
+If the VIRTIO_IOMMU_F_TOPOLOGY feature is negotiated, \field{topo_config}
+describes the topology description array. \field{topo_config.offset} is
+the offset between the beginning of the device-specific configuration
+space (struct virtio_iommu_config) and the array. The array is composed of
+\field{topo_config.count} topology structures. A topology structure
+defines the endpoint ID of one or more endpoints managed by the
+virtio-iommu device. Each structure has a \field{length} field, which
+defines the offset from the beginning of the structure to the beginning of
+the next one. Each structure starts with a \field{type} byte:
+
+\begin{description}
+  \item[VIRTIO_IOMMU_TOPO_PCI_RANGE (1)] struct virtio_iommu_topo_pci_range
+  \item[VIRTIO_IOMMU_TOPO_MMIO (2)] struct virtio_iommu_topo_mmio
+\end{description}
+
+To stay binary compatible with ACPI VIOT, types 3 and 4 are currently
+reserved.
+
+\paragraph{PCI range}\label{sec:Device Types / IOMMU Device / Device 
initialization / topology / PCI range}
+
+\begin{lstlisting}
+struct virtio_iommu_topo_pci_range {
+  u8 type;
+  u8 reserved;
+  le16 length;
+  le32 endpoint_start;
+  le16 segment;
+  le16 bdf_start;
+  le16 bdf_end;
+  le16 padding;
+};
+\end{lstlisting}
+
+The PCI range structure describes the endpoint IDs of a series of PCI
+devices identified by their BDF (Bus-Device-Function) number.
+
+\begin{description}
+  \item[\field{type}] VIRTIO_IOMMU_TOPO_PCI_RANGE.
+  \item[\field{length}] Length of the structure in bytes.
+  \item[\field{endpoint_start}] First endpoint ID.
+  \item[\field{segment}] Identifier of the PCI hierarchy. Sometimes called
+domain number.
+  \item[\field{bdf_start}] First BDF in the range.
+  \item[\field{bdf_end}] Last BDF in the range.
+\end{description}
+
+The correspondence between a PCI BDF in the range
+[ bdf_s

[virtio-dev] Re: [PATCH virtio] virtio-iommu: Add built-in topology description

2020-07-17 Thread Jean-Philippe Brucker
On Wed, Jul 15, 2020 at 10:52:30AM +0200, Auger Eric wrote:
> > +If the VIRTIO_IOMMU_F_TOPOLOGY feature is negotiated, \field{topo_config}
> > +describes the topology description array. \field{topo_config.offset} is
> > +the offset between the beginning of the device-specific configuration
> > +space (struct virtio_iommu_config) and the array. The array is composed of
> > +\field{topo_config.num_items} topology structures. A topology structure
> > +defines the endpoint ID of one or more endpoints managed by the
> > +virtio-iommu device. Each structure has a \field{length} field, which
> > +defines the offset to the next structure.
> 
> If this is an offset, then you may precise what the offset is relative
> to (I guess the beginning of the current topology structure).

Ok, changing it to "which defines the offset from the beginning of the
structure to the beginning of the next one".

> Otherwise
> the structure only has fixed size elements so is its length expected to
> vary?

Length is only for backward compatibility: if a future version of the spec
appends new elements to the structure, then the current driver can ignore
those elements and jump to the next structure (like argsz in VFIO, and the
PROBE property length).

Thanks,
Jean


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH virtio] virtio-iommu: Add built-in topology description

2020-07-09 Thread Jean-Philippe Brucker
Add a simple method to describe the IOMMU topology in the config space,
guarded by a new feature bit. A list of capabilities in the config space
describes the devices managed by the IOMMU and their endpoint IDs.

As outlined in paragraph #1 below, the method used to describe I/O
topology depends on the platform. Device-Tree based platforms use the
"iommus" and "iommu-map" DT properties, while ACPI based platforms will
use the VIOT table (whose format is compatible with this built-in
description, but is under discussion). This adds a method for
lightweight platforms that don't use either DT or ACPI.

Signed-off-by: Jean-Philippe Brucker 
---
 virtio-iommu.tex | 102 +++
 1 file changed, 102 insertions(+)

diff --git a/virtio-iommu.tex b/virtio-iommu.tex
index 08b358a..a1de2a8 100644
--- a/virtio-iommu.tex
+++ b/virtio-iommu.tex
@@ -67,6 +67,9 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU 
Device / Feature bits}
 
 \item[VIRTIO_IOMMU_F_MMIO (5)]
   The VIRTIO_IOMMU_MAP_F_MMIO flag is available.
+
+\item[VIRTIO_IOMMU_F_TOPOLOGY (6)]
+  Topology description is available at \field{topo_config}.
 \end{description}
 
 \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / 
Feature bits}
@@ -97,6 +100,10 @@ \subsection{Device configuration layout}\label{sec:Device 
Types / IOMMU Device /
 le32 end;
   } domain_range;
   le32 probe_size;
+  struct virtio_iommu_topo_config {
+le16 num_items;
+le16 offset;
+  } topo_config;
 };
 \end{lstlisting}
 
@@ -139,6 +146,101 @@ \subsection{Device initialization}\label{sec:Device Types 
/ IOMMU Device / Devic
 If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the
 device SHOULD NOT let endpoints access the guest-physical address space.
 
+\subsubsection{Built-in topology description}\label{sec:Device Types / IOMMU 
Device / Device initialization / topology}
+
+The device manages memory accesses from endpoints, identified by endpoint
+IDs. The driver can discover which endpoint ID corresponds to an endpoint
+using a method specific to the platform. Platforms described with device
+tree use the \texttt{iommus} and \texttt{iommu-map} properties embedded
+into device nodes for this purpose. Platforms described with ACPI use a
+table such as the Virtual I/O Translation Table (VIOT). Platforms that do
+not support either device tree or ACPI may embed a minimalistic
+description in the device configuration space, declared by the
+VIRTIO_IOMMU_F_TOPOLOGY feature.
+
+A disadvantage of describing the topology from within the device is the
+lack of initialization ordering information. Out-of-band descriptions such
+as device tree and ACPI let the operating system know about device
+dependencies so that it can initialize supplier devices (IOMMUs) before
+their consumers (endpoints). Platforms using the VIRTIO_IOMMU_F_TOPOLOGY
+feature have to communicate the device dependency in another way.
+
+If the VIRTIO_IOMMU_F_TOPOLOGY feature is negotiated, \field{topo_config}
+describes the topology description array. \field{topo_config.offset} is
+the offset between the beginning of the device-specific configuration
+space (struct virtio_iommu_config) and the array. The array is composed of
+\field{topo_config.num_items} topology structures. A topology structure
+defines the endpoint ID of one or more endpoints managed by the
+virtio-iommu device. Each structure has a \field{length} field, which
+defines the offset to the next structure. Each structure starts with a
+\field{type} byte:
+
+\begin{description}
+  \item[VIRTIO_IOMMU_TOPO_PCI_RANGE (1)] struct virtio_iommu_topo_pci_range
+  \item[VIRTIO_IOMMU_TOPO_MMIO (2)] struct virtio_iommu_topo_mmio
+\end{description}
+
+To stay binary compatible with ACPI VIOT, types 3 and 4 are currently
+reserved.
+
+\paragraph{PCI range}\label{sec:Device Types / IOMMU Device / Device 
initialization / topology / PCI range}
+
+\begin{lstlisting}
+struct virtio_iommu_topo_pci_range {
+  u8 type;
+  u8 reserved;
+  le16 length;
+  le32 endpoint_start;
+  le16 segment;
+  le16 bdf_start;
+  le16 bdf_end;
+  le16 padding;
+};
+\end{lstlisting}
+
+The PCI range structure describes the endpoint IDs of a series of PCI
+devices identified by their BDF (Bus-Device-Function) number.
+
+\begin{description}
+  \item[\field{type}] VIRTIO_IOMMU_TOPO_PCI_RANGE.
+  \item[\field{length}] Length of the structure in bytes.
+  \item[\field{endpoint_start}] First endpoint ID.
+  \item[\field{segment}] Identifier of the PCI hierarchy. Sometimes called
+domain number.
+  \item[\field{bdf_start}] First BDF in the range.
+  \item[\field{bdf_end}] Last BDF in the range.
+\end{description}
+
+The correspondence between a PCI BDF in the range
+[ bdf_start; bdf_end ] and its endpoint IDs is a linear transformation:
+endpoint_id = bdf - bdf_start + endpoint_start.
+
+\paragraph{Single endpoint}\label{sec:Device Types / IOMMU Device / Device 
initialization

[virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources

2020-06-19 Thread Jean-Philippe Brucker
On Wed, Jun 17, 2020 at 06:31:15PM +0100, Alex Bennée wrote:
[...]
> Option 2 - Additional Platform Data
> ===
> 
> This would be extending using something like device tree or ACPI tables
> which could define regions of memory that would inform the low level
> memory allocation routines where they could allocate from. There is
> already of the concept of "dma-ranges" in device tree which can be a
> per-device property which defines the region of space that is DMA
> coherent for a device.

They are regions that are accessible to a device for DMA, coherency is
described through other methods.

Thinking more about this, dma-ranges (and ACPI _DMA) don't exactly
describe what you need. They describe addressing limitation from a
bridge's perspective, for example from the PCI root complex. So there are
at least two issues:

1. They apply to the whole downstream bus, so you can't define per-device
   DMA windows. Although with PCIe I suppose you could put one on each
   downstream port.

2. More importantly, they only describe addressing limitations locally.
   When the device directly accesses memory, it emits guest-physical
   addresses (GPA) so you can use DMA ranges to describe which memory it
   can access. However, if there is an IOMMU in between, the device emits
   I/O virtual addresses (IOVA), which are translated by the IOMMU into
   GPA. In this case the DMA ranges apply to the IOVA, and there doesn't
   exist a way to describe limitations on the GPA.

There are other mechanisms describing addressing limitations such as
Intel's RMRR, but those also apply to IOVAs as far as I know.

Thanks,
Jean

> 
> There is the question of how you tie regions declared here with the
> eventual instantiating of the VirtIO devices?
> 
> For a fully distributed set of backends (one backend per device per
> worker VM) you would need several different regions. Would each region
> be tied to each device or just a set of areas the guest would allocate
> from in sequence?

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v6] iommu/virtio: Use page size bitmap supported by endpoint

2020-05-14 Thread Jean-Philippe Brucker
On Thu, May 14, 2020 at 05:31:00AM -0400, Michael S. Tsirkin wrote:
> On Thu, May 14, 2020 at 01:22:37PM +0530, Bharat Bhushan wrote:
> > Different endpoint can support different page size, probe
> > endpoint if it supports specific page size otherwise use
> > global page sizes.
> > 
> > Device attached to domain should support a minimum of
> > domain supported page sizes. If device supports more
> > than domain supported page sizes then device is limited
> > to use domain supported page sizes only.
> 
> OK so I am just trying to figure it out.
> Before the patch, we always use the domain supported page sizes
> right?
> 
> With the patch, we still do, but we also probe and
> validate that device supports all domain page sizes,
> if it does not then we fail to attach the device.

Generally there is one endpoint per domain. Linux creates the domains and
decides which endpoint goes in which domain. It puts multiple endpoints in
a domain in two cases:

* If endpoints cannot be isolated from each others by the IOMMU, for
  example if ACS isolation isn't enabled in PCIe. In that case endpoints
  are in the same IOMMU group, and therefore contained in the same domain.
  This is more of a quirk for broken hardware, and isn't much of a concern
  for virtualization because it's easy for the hypervisor to present
  endpoints isolated from each others.

* If userspace wants to put endpoints in the same VFIO container, then
  VFIO first attempts to put them in the same IOMMU domain, and falls back
  to multiple domains if that fails. That case is just a luxury and we
  shouldn't over-complicate the driver to cater for this.

So the attach checks don't need to be that complicated. Checking that the
page masks are exactly the same should be enough.

> This seems like a lot of effort for little benefit, can't
> hypervisor simply make sure endpoints support the
> iommu page sizes for us?

I tend to agree, it's not very likely that we'll have a configuration with
different page sizes between physical and virtual endpoints.

If there is a way for QEMU to simply reject VFIO devices that don't use
the same page mask as what's configured globally, let's do that instead of
introducing the page_size_mask property.

> > @@ -615,7 +636,7 @@ static int viommu_domain_finalise(struct 
> > viommu_endpoint *vdev,
> > struct viommu_dev *viommu = vdev->viommu;
> > struct viommu_domain *vdomain = to_viommu_domain(domain);
> >  
> > -   viommu_page_size = 1UL << __ffs(viommu->pgsize_bitmap);
> > +   viommu_page_size = 1UL << __ffs(vdev->pgsize_bitmap);
> > if (viommu_page_size > PAGE_SIZE) {
> > dev_err(vdev->dev,
> > "granule 0x%lx larger than system page size 0x%lx\n",
> 
> 
> Looks like this is messed up on 32 bit: e.g. 0x1 will try to do
> 1UL << -1, which is undefined behaviour. Which is btw already messed up
> wrt viommu->pgsize_bitmap, but that's not a reason to propagate
> the error.

Realistically we're not going to have a page granule larger than 4G, it's
going to be 4k or 64k. But we can add a check that truncates the
page_size_mask to 32-bit and makes sure that it's non-null.

> > +struct virtio_iommu_probe_pgsize_mask {
> > +   struct virtio_iommu_probe_property  head;
> > +   __u8reserved[4];
> > +   /* Same format as virtio_iommu_config::page_size_mask */
> 
> It's actually slightly different in that
> this must be a superset of domain page size mask, right?

No it overrides the global mask

> > +   __le64  pgsize_bitmap;

Bharat, please rename this to page_size_mask for consistency

Thanks,
Jean

> > +};
> > +
> >  #define VIRTIO_IOMMU_RESV_MEM_T_RESERVED   0
> >  #define VIRTIO_IOMMU_RESV_MEM_T_MSI1
> >  
> > -- 
> > 2.17.1
> 

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v2] virtio-iommu: Add PAGE_SIZE_MASK property

2020-04-23 Thread Jean-Philippe Brucker
Add a PROBE property to declare the mapping granularity per endpoint.
The virtio-iommu device already has to declare a granule in its config
space, but when endpoints are behind different physical IOMMUs (for
example some pass-through and some PV endpoints), they may have
different mapping granules. This new property allows to override the
global page_size_mask for each endpoint. When the property isn't
reported for an endpoint, the global page size mask applies. The device
checks page alignment during MAP requests.

Signed-off-by: Jean-Philippe Brucker 
---
v1: https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg06035.html
v2 clarifies how overriding the global property works.
---
 virtio-iommu.tex | 38 +-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/virtio-iommu.tex b/virtio-iommu.tex
index 08b358a..05f4fd6 100644
--- a/virtio-iommu.tex
+++ b/virtio-iommu.tex
@@ -216,6 +216,9 @@ \subsection{Device operations}\label{sec:Device Types / 
IOMMU Device / Device op
   Creating mappings aligned on large page sizes can improve performance
   since they require fewer page table and TLB entries.
 
+  The mask can be overriden for an endpoint by the PAGE_SIZE_MASK PROBE
+  property.
+
 \item If the VIRTIO_IOMMU_F_DOMAIN_RANGE feature is offered,
   \field{domain_range} describes the values supported in a \field{domain}
   field. If the feature is not offered, any \field{domain} value is valid.
@@ -664,7 +667,8 @@ \subsubsection{PROBE request}\label{sec:Device Types / 
IOMMU Device / Device ope
 \subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / 
Device operations / PROBE properties}
 
 \begin{lstlisting}
-#define VIRTIO_IOMMU_PROBE_T_RESV_MEM   1
+#define VIRTIO_IOMMU_PROBE_T_RESV_MEM 1
+#define VIRTIO_IOMMU_PROBE_T_PAGE_SIZE_MASK   2
 \end{lstlisting}
 
 \paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device 
operations / PROBE properties / RESVMEM}
@@ -727,6 +731,38 @@ \subsubsection{PROBE properties}\label{sec:Device Types / 
IOMMU Device / Device
 The device SHOULD NOT allow accesses from the endpoint to RESV_MEM regions
 to affect any other component than the endpoint and the driver.
 
+\paragraph{Property PAGE_SIZE_MASK}\label{sec:Device Types / IOMMU Device / 
Device operations / PROBE properties / PAGESIZEMASK}
+
+The PAGE_SIZE_MASK property overrides the global \field{page_size_mask}
+configuration for an endpoint.
+
+\begin{lstlisting}
+struct virtio_iommu_probe_page_size_mask {
+  struct virtio_iommu_probe_property head;
+  u8reserved[4];
+  le64  page_size_mask;
+};
+\end{lstlisting}
+
+The \field{page_size_mask} field behaves in the same way as the global
+\field{page_size_mask} field. The least significant bit describes the
+mapping granularity, and additional bits are hints (see \ref{sec:Device
+Types / IOMMU Device / Device operations}). When no PAGE_SIZE_MASK
+property is available for an endpoint, the global \field{page_size_mask}
+applies.
+
+\drivernormative{\subparagraph}{Property PAGE_SIZE_MASK}{Device Types / IOMMU 
Device / Device operations / PROBE properties / PAGESIZEMASK}
+
+The driver MUST ignore \field{reserved}.
+
+\devicenormative{\subparagraph}{Property PAGE_SIZE_MASK}{Device Types / IOMMU 
Device / Device operations / PROBE properties / PAGESIZEMASK}
+
+The device SHOULD set \field{reserved} to zero.
+
+The device MUST set at least one bit in \field{page_size_mask}, describing
+the page granularity. The device MAY set more than one bit in
+\field{page_size_mask}.
+
 \subsubsection{Fault reporting}\label{sev:Device Types / IOMMU Device / Device 
operations / Fault reporting}
 
 The device can report translation faults and other significant
-- 
2.26.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add PAGE_SIZE_MASK property

2020-03-27 Thread Jean-Philippe Brucker
On Fri, Mar 27, 2020 at 12:05:42PM +0100, Auger Eric wrote:
> Hi Jean, Bharat,
> 
> On 3/27/20 10:35 AM, Jean-Philippe Brucker wrote:
> > On Fri, Mar 27, 2020 at 09:17:43AM +, Bharat Bhushan wrote:
> >>
> >> Sent again, somehow the email-address got corrupted.
> >>
> >>> -Original Message-
> >>> From: Bharat Bhushan
> >>> Sent: Friday, March 27, 2020 2:46 PM
> >>> To: @mx0a-0016f401.pphosted.com
> >>> Cc: Auger Eric ; virtio-dev@lists.oasis-open.org
> >>> Subject: RE: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add
> >>> PAGE_SIZE_MASK property
> >>>
> >>> Hi Jean,
> >>>
> >>>> -Original Message-
> >>>> From: Jean-Philippe Brucker 
> >>>> Sent: Friday, March 27, 2020 2:31 PM
> >>>> To: Bharat Bhushan 
> >>>> Cc: Auger Eric ;
> >>>> virtio-dev@lists.oasis-open.org
> >>>> Subject: Re: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add
> >>>> PAGE_SIZE_MASK property
> >>>>
> >>>> On Fri, Mar 27, 2020 at 05:20:56AM +, Bharat Bhushan wrote:
> >>>>> Hi Jean,
> >>>>>
> >>>>>> -Original Message-
> >>>>>> From: Auger Eric 
> >>>>>> Sent: Thursday, March 26, 2020 4:50 PM
> >>>>>> To: Jean-Philippe Brucker 
> >>>>>> Cc: virtio-dev@lists.oasis-open.org; Bharat Bhushan
> >>>>>> 
> >>>>>> Subject: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add
> >>>>>> PAGE_SIZE_MASK property
> >>>>>>
> >>>>>> External Email
> >>>>>>
> >>>>>> --
> >>>>>> --
> >>>>>> --
> >>>>>> Hi Jean,
> >>>>>>
> >>>>>> On 3/26/20 11:49 AM, Jean-Philippe Brucker wrote:
> >>>>>>> On Mon, Mar 23, 2020 at 03:22:40PM +0100, Auger Eric wrote:
> >>>>>>>> Hi Jean,
> >>>>>>>>
> >>>>>>>> On 3/23/20 2:38 PM, Jean-Philippe Brucker wrote:
> >>>>>>>>> Add a PROBE property to declare the mapping granularity per 
> >>>>>>>>> endpoint.
> >>>>>>>>> The virtio-iommu device already declares a granule in its
> >>>>>>>>> config space, but when endpoints are behind different physical
> >>>>>>>>> IOMMUs, they may have different mapping granules. This new
> >>>>>>>>> property allows to override the global page_size_mask for each 
> >>>>>>>>> endpoint.
> >>>>>>>>>
> >>>>>>>>> In the future it may be useful to describe more than one
> >>>>>>>>> page_size_mask for each endpoint, and allow them to negotiate
> >>>>>>>>> it during ATTACH. For example two masks could allow the driver
> >>>>>>>>> to choose between 4k and 64k granule, along with their
> >>>>>>>>> respective block mapping sizes. This could be added by
> >>>>>>>>> replacing \field{reserved} with an array
> >>>>>> length, for example.
> >>>>>>>> Sorry I don't get the use case where several page size bitmaps
> >>>>>>>> should be exposed.
> >>>>>>>
> >>>>>>> For a 4k granule you get block mappings of 2M and 1G. For a 64k
> >>>>>>> granule you get 512M and 4T block mappings. If you want to
> >>>>>>> communicate both options to the guest, you need two separate
> >>>>>>> masks, 0x40201000 and 0x4002001. Then the guest could choose
> >>>>>>> one of the granules during attach, if we add a flag to the
> >>>>>>> attach request. I'm not suggesting we do that now, just trying
> >>>>>>> to make sure it can be extended if anyone actually wants it.
> >>>>>>> Personally I don't think it's worth adding, especially given the
> >>>>>>> additional work required in
> >>>> the host.
> >>>>>> OK I get it now.
> >>>>>
> >>>

[virtio-dev] Re: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add PAGE_SIZE_MASK property

2020-03-27 Thread Jean-Philippe Brucker
On Fri, Mar 27, 2020 at 09:17:43AM +, Bharat Bhushan wrote:
> 
> Sent again, somehow the email-address got corrupted.
> 
> > -Original Message-
> > From: Bharat Bhushan
> > Sent: Friday, March 27, 2020 2:46 PM
> > To: @mx0a-0016f401.pphosted.com
> > Cc: Auger Eric ; virtio-dev@lists.oasis-open.org
> > Subject: RE: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add
> > PAGE_SIZE_MASK property
> > 
> > Hi Jean,
> > 
> > > -Original Message-
> > > From: Jean-Philippe Brucker 
> > > Sent: Friday, March 27, 2020 2:31 PM
> > > To: Bharat Bhushan 
> > > Cc: Auger Eric ;
> > > virtio-dev@lists.oasis-open.org
> > > Subject: Re: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add
> > > PAGE_SIZE_MASK property
> > >
> > > On Fri, Mar 27, 2020 at 05:20:56AM +, Bharat Bhushan wrote:
> > > > Hi Jean,
> > > >
> > > > > -Original Message-
> > > > > From: Auger Eric 
> > > > > Sent: Thursday, March 26, 2020 4:50 PM
> > > > > To: Jean-Philippe Brucker 
> > > > > Cc: virtio-dev@lists.oasis-open.org; Bharat Bhushan
> > > > > 
> > > > > Subject: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add
> > > > > PAGE_SIZE_MASK property
> > > > >
> > > > > External Email
> > > > >
> > > > > --
> > > > > --
> > > > > --
> > > > > Hi Jean,
> > > > >
> > > > > On 3/26/20 11:49 AM, Jean-Philippe Brucker wrote:
> > > > > > On Mon, Mar 23, 2020 at 03:22:40PM +0100, Auger Eric wrote:
> > > > > >> Hi Jean,
> > > > > >>
> > > > > >> On 3/23/20 2:38 PM, Jean-Philippe Brucker wrote:
> > > > > >>> Add a PROBE property to declare the mapping granularity per 
> > > > > >>> endpoint.
> > > > > >>> The virtio-iommu device already declares a granule in its
> > > > > >>> config space, but when endpoints are behind different physical
> > > > > >>> IOMMUs, they may have different mapping granules. This new
> > > > > >>> property allows to override the global page_size_mask for each 
> > > > > >>> endpoint.
> > > > > >>>
> > > > > >>> In the future it may be useful to describe more than one
> > > > > >>> page_size_mask for each endpoint, and allow them to negotiate
> > > > > >>> it during ATTACH. For example two masks could allow the driver
> > > > > >>> to choose between 4k and 64k granule, along with their
> > > > > >>> respective block mapping sizes. This could be added by
> > > > > >>> replacing \field{reserved} with an array
> > > > > length, for example.
> > > > > >> Sorry I don't get the use case where several page size bitmaps
> > > > > >> should be exposed.
> > > > > >
> > > > > > For a 4k granule you get block mappings of 2M and 1G. For a 64k
> > > > > > granule you get 512M and 4T block mappings. If you want to
> > > > > > communicate both options to the guest, you need two separate
> > > > > > masks, 0x40201000 and 0x4002001. Then the guest could choose
> > > > > > one of the granules during attach, if we add a flag to the
> > > > > > attach request. I'm not suggesting we do that now, just trying
> > > > > > to make sure it can be extended if anyone actually wants it.
> > > > > > Personally I don't think it's worth adding, especially given the
> > > > > > additional work required in
> > > the host.
> > > > > OK I get it now.
> > > >
> > > > What some clarification about two page-size-mask configurations 
> > > > available.
> > > >  - Global configuration for page-size-mask
> > > >  - per endpoint page-size-mask configuration
> > > >
> > > > PAGE_SIZE_MASK probe for and endpoint can return zero or non-zero value.
> > > > If it returns non-zero value than it will override the global 
> > > > configuration.
> > > > If PAGE_SIZE_MASK probe for and endpoint return zero value than
> > > > global pa

[virtio-dev] Re: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add PAGE_SIZE_MASK property

2020-03-27 Thread Jean-Philippe Brucker
On Fri, Mar 27, 2020 at 05:20:56AM +, Bharat Bhushan wrote:
> Hi Jean,
> 
> > -Original Message-
> > From: Auger Eric 
> > Sent: Thursday, March 26, 2020 4:50 PM
> > To: Jean-Philippe Brucker 
> > Cc: virtio-dev@lists.oasis-open.org; Bharat Bhushan 
> > Subject: [EXT] Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add 
> > PAGE_SIZE_MASK
> > property
> > 
> > External Email
> > 
> > ------
> > Hi Jean,
> > 
> > On 3/26/20 11:49 AM, Jean-Philippe Brucker wrote:
> > > On Mon, Mar 23, 2020 at 03:22:40PM +0100, Auger Eric wrote:
> > >> Hi Jean,
> > >>
> > >> On 3/23/20 2:38 PM, Jean-Philippe Brucker wrote:
> > >>> Add a PROBE property to declare the mapping granularity per endpoint.
> > >>> The virtio-iommu device already declares a granule in its config
> > >>> space, but when endpoints are behind different physical IOMMUs, they
> > >>> may have different mapping granules. This new property allows to
> > >>> override the global page_size_mask for each endpoint.
> > >>>
> > >>> In the future it may be useful to describe more than one
> > >>> page_size_mask for each endpoint, and allow them to negotiate it
> > >>> during ATTACH. For example two masks could allow the driver to
> > >>> choose between 4k and 64k granule, along with their respective block
> > >>> mapping sizes. This could be added by replacing \field{reserved} with 
> > >>> an array
> > length, for example.
> > >> Sorry I don't get the use case where several page size bitmaps should
> > >> be exposed.
> > >
> > > For a 4k granule you get block mappings of 2M and 1G. For a 64k
> > > granule you get 512M and 4T block mappings. If you want to communicate
> > > both options to the guest, you need two separate masks, 0x40201000 and
> > > 0x4002001. Then the guest could choose one of the granules during
> > > attach, if we add a flag to the attach request. I'm not suggesting we
> > > do that now, just trying to make sure it can be extended if anyone
> > > actually wants it. Personally I don't think it's worth adding,
> > > especially given the additional work required in the host.
> > OK I get it now.
> 
> What some clarification about two page-size-mask configurations available.
>  - Global configuration for page-size-mask
>  - per endpoint page-size-mask configuration
> 
> PAGE_SIZE_MASK probe for and endpoint can return zero or non-zero value.
> If it returns non-zero value than it will override the global configuration.
> If PAGE_SIZE_MASK probe for and endpoint return zero value than global 
> page-size-mask configuration will be used.
> 
> Is that correct?

Yes. If a PAGE_SIZE_MASK property is available for an endpoint, the driver
should use that mask. Otherwise it should use the global mask, which is
always provided.

I wonder, should we introduce some form of negotiation now?  If the driver
doesn't know about the new probe property, it will use the global mask. At
some point it will send a MAP request not aligned on the page granule, and
the device will abort the request. If instead we add a flag and page mask
field to the attach request, the device would know that the driver didn't
understand the per-endpoint page mask and abort the attach.

Thanks,
Jean


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [RFC PATCH] virtio-iommu: Add PAGE_SIZE_MASK property

2020-03-26 Thread Jean-Philippe Brucker
On Mon, Mar 23, 2020 at 03:22:40PM +0100, Auger Eric wrote:
> Hi Jean,
> 
> On 3/23/20 2:38 PM, Jean-Philippe Brucker wrote:
> > Add a PROBE property to declare the mapping granularity per endpoint.
> > The virtio-iommu device already declares a granule in its config space,
> > but when endpoints are behind different physical IOMMUs, they may have
> > different mapping granules. This new property allows to override the
> > global page_size_mask for each endpoint.
> > 
> > In the future it may be useful to describe more than one page_size_mask
> > for each endpoint, and allow them to negotiate it during ATTACH. For
> > example two masks could allow the driver to choose between 4k and 64k
> > granule, along with their respective block mapping sizes. This could be
> > added by replacing \field{reserved} with an array length, for example.
> Sorry I don't get the use case where several page size bitmaps should be
> exposed.

For a 4k granule you get block mappings of 2M and 1G. For a 64k granule
you get 512M and 4T block mappings. If you want to communicate both
options to the guest, you need two separate masks, 0x40201000 and
0x4002001. Then the guest could choose one of the granules during
attach, if we add a flag to the attach request. I'm not suggesting we do
that now, just trying to make sure it can be extended if anyone actually
wants it. Personally I don't think it's worth adding, especially given the
additional work required in the host.

> > 
> > Signed-off-by: Jean-Philippe Brucker 
> > ---
> > I had this change planned for a page-table sharing extension, but it
> > seems to be useful for the VFIO support in QEMU as well [1].
> > 
> > [1] 
> > https://lore.kernel.org/qemu-devel/3da60c1b-6897-7ab1-3a67-bec44fa00...@redhat.com/
> > ---
> > ---
> >  virtio-iommu.tex | 33 -
> >  1 file changed, 32 insertions(+), 1 deletion(-)
> > 
> > diff --git a/virtio-iommu.tex b/virtio-iommu.tex
> > index 08b358a..d7be13d 100644
> > --- a/virtio-iommu.tex
> > +++ b/virtio-iommu.tex
> > @@ -664,7 +664,8 @@ \subsubsection{PROBE request}\label{sec:Device Types / 
> > IOMMU Device / Device ope
> >  \subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / 
> > Device operations / PROBE properties}
> >  
> >  \begin{lstlisting}
> > -#define VIRTIO_IOMMU_PROBE_T_RESV_MEM   1
> > +#define VIRTIO_IOMMU_PROBE_T_RESV_MEM 1
> > +#define VIRTIO_IOMMU_PROBE_T_PAGE_SIZE_MASK   2
> >  \end{lstlisting}
> >  
> >  \paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / 
> > Device operations / PROBE properties / RESVMEM}
> > @@ -727,6 +728,36 @@ \subsubsection{PROBE properties}\label{sec:Device 
> > Types / IOMMU Device / Device
> >  The device SHOULD NOT allow accesses from the endpoint to RESV_MEM regions
> >  to affect any other component than the endpoint and the driver.
> >  
> > +\paragraph{Property PAGE_SIZE_MASK}\label{sec:Device Types / IOMMU Device 
> > / Device operations / PROBE properties / PAGESIZEMASK}
> "Property PAGESIZEMASK" to be homegeneous with "Property RESV_MEM"?

I don't understand, isn't the underscore version more homogeneous? I only
used PAGESIZEMASK and RESVMEM in the label to avoid troubles with latex,
which complains about underscores in labels (maybe because of the
underscore package). But labels are only used for resolving references and
don't appear in the final text.

> > +
> > +The PAGE_SIZE_MASK property overrides the global \field{page_size_mask}
> > +configuration for an endpoint.
> > +
> > +\begin{lstlisting}
> > +struct virtio_iommu_probe_page_size_mask {
> > +  struct virtio_iommu_probe_property head;
> > +  u8reserved[4];
> > +  le64  page_size_mask;
> > +};
> > +\end{lstlisting}
> 
> Maybe it would be useful to put in the spec the context of use of the
> new property (basically what you wrote in the commit msg). Also
> re-inforce that if no ep page size mask is returned the default one
> applies. Chapters describing page_size_mask global field may be sligtly
> augmented to mention the new prop...

I'll add that if no page size mask property is present, the global page
size mask applies. And in Device operations I'll add that these limits are
described in the device configuration as well as probe properties.

Thanks,
Jean

> > +
> > +The \field{page_size_mask} field behaves in the same way as the global
> > +\field{page_size_mask} field. The least significant bit describes the
> > +mapping granularity, and additional bits are hints (see \ref{sec:Device
> > +Types / IOMMU Device / D

[virtio-dev] [RFC PATCH] virtio-iommu: Add PAGE_SIZE_MASK property

2020-03-23 Thread Jean-Philippe Brucker
Add a PROBE property to declare the mapping granularity per endpoint.
The virtio-iommu device already declares a granule in its config space,
but when endpoints are behind different physical IOMMUs, they may have
different mapping granules. This new property allows to override the
global page_size_mask for each endpoint.

In the future it may be useful to describe more than one page_size_mask
for each endpoint, and allow them to negotiate it during ATTACH. For
example two masks could allow the driver to choose between 4k and 64k
granule, along with their respective block mapping sizes. This could be
added by replacing \field{reserved} with an array length, for example.

Signed-off-by: Jean-Philippe Brucker 
---
I had this change planned for a page-table sharing extension, but it
seems to be useful for the VFIO support in QEMU as well [1].

[1] 
https://lore.kernel.org/qemu-devel/3da60c1b-6897-7ab1-3a67-bec44fa00...@redhat.com/
---
---
 virtio-iommu.tex | 33 -
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/virtio-iommu.tex b/virtio-iommu.tex
index 08b358a..d7be13d 100644
--- a/virtio-iommu.tex
+++ b/virtio-iommu.tex
@@ -664,7 +664,8 @@ \subsubsection{PROBE request}\label{sec:Device Types / 
IOMMU Device / Device ope
 \subsubsection{PROBE properties}\label{sec:Device Types / IOMMU Device / 
Device operations / PROBE properties}
 
 \begin{lstlisting}
-#define VIRTIO_IOMMU_PROBE_T_RESV_MEM   1
+#define VIRTIO_IOMMU_PROBE_T_RESV_MEM 1
+#define VIRTIO_IOMMU_PROBE_T_PAGE_SIZE_MASK   2
 \end{lstlisting}
 
 \paragraph{Property RESV_MEM}\label{sec:Device Types / IOMMU Device / Device 
operations / PROBE properties / RESVMEM}
@@ -727,6 +728,36 @@ \subsubsection{PROBE properties}\label{sec:Device Types / 
IOMMU Device / Device
 The device SHOULD NOT allow accesses from the endpoint to RESV_MEM regions
 to affect any other component than the endpoint and the driver.
 
+\paragraph{Property PAGE_SIZE_MASK}\label{sec:Device Types / IOMMU Device / 
Device operations / PROBE properties / PAGESIZEMASK}
+
+The PAGE_SIZE_MASK property overrides the global \field{page_size_mask}
+configuration for an endpoint.
+
+\begin{lstlisting}
+struct virtio_iommu_probe_page_size_mask {
+  struct virtio_iommu_probe_property head;
+  u8reserved[4];
+  le64  page_size_mask;
+};
+\end{lstlisting}
+
+The \field{page_size_mask} field behaves in the same way as the global
+\field{page_size_mask} field. The least significant bit describes the
+mapping granularity, and additional bits are hints (see \ref{sec:Device
+Types / IOMMU Device / Device operations})
+
+\drivernormative{\subparagraph}{Property PAGE_SIZE_MASK}{Device Types / IOMMU 
Device / Device operations / PROBE properties / PAGESIZEMASK}
+
+The driver MUST ignore \field{reserved}.
+
+\devicenormative{\subparagraph}{Property PAGE_SIZE_MASK}{Device Types / IOMMU 
Device / Device operations / PROBE properties / PAGESIZEMASK}
+
+The device SHOULD set \field{reserved} to zero.
+
+The device MUST set at least one bit in \field{page_size_mask}, describing
+the page granularity. The device MAY set more than one bit in
+\field{page_size_mask}.
+
 \subsubsection{Fault reporting}\label{sev:Device Types / IOMMU Device / Device 
operations / Fault reporting}
 
 The device can report translation faults and other significant
-- 
2.25.1


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH] virtio-iommu: Remove invalid requirement about padding

2020-03-23 Thread Jean-Philippe Brucker
This reference to 'padding' is a leftover from a previous draft of the
virtio-iommu device. The field doesn't exist anymore, remove the
requirement.

Signed-off-by: Jean-Philippe Brucker 
---
I just noticed this was still here, sorry for the delay. Previous
discussion:
https://lists.oasis-open.org/archives/virtio-dev/201911/msg00084.html
---
 virtio-iommu.tex | 2 --
 1 file changed, 2 deletions(-)

diff --git a/virtio-iommu.tex b/virtio-iommu.tex
index 28c562b..08b358a 100644
--- a/virtio-iommu.tex
+++ b/virtio-iommu.tex
@@ -106,8 +106,6 @@ \subsection{Device configuration layout}\label{sec:Device 
Types / IOMMU Device /
 
 \devicenormative{\subsubsection}{Device configuration layout}{Device Types / 
IOMMU Device / Device configuration layout}
 
-The device SHOULD set \field{padding} to zero.
-
 The device MUST set at least one bit in \field{page_size_mask}, describing
 the page granularity. The device MAY set more than one bit in
 \field{page_size_mask}.
-- 
2.25.1


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC 00/13] virtio-iommu on non-devicetree platforms

2019-12-18 Thread Jean-Philippe Brucker
On Tue, Dec 03, 2019 at 07:01:36PM -0800, Jacob Pan (Jun) wrote:
> Hi Jean,
> 
> Sorry for the delay, I was out last week. Comments inline below.
> 
> On Mon, 25 Nov 2019 19:02:47 +0100
> Jean-Philippe Brucker  wrote:
> 
> > On Fri, Nov 22, 2019 at 04:01:02PM -0800, Jacob Pan (Jun) wrote:
> > > > (1) ACPI has one table per vendor (DMAR for Intel, IVRS for AMD
> > > > and IORT for Arm). From my point of view IORT is easier to
> > > > extend, since we just need to introduce a new node type. There
> > > > are no dependencies to Arm in the Linux IORT driver, so it works
> > > > well with CONFIG_X86. 
> > > From my limited understanding, IORT and VIOT is to solve device
> > > topology enumeration only? I am not sure how it can be expanded to
> > > cover information beyond device topology. e.g. DMAR has NUMA
> > > information and root port ATS, I guess they are not used today in
> > > the guest but might be additions in the future.  
> > 
> > The PCI root-complex node of IORT has an ATS attribute, which we can
> > already use. However its scope is the root complex, not individual
> > root ports like with DMAR.
> > 
> > I'm not very familiar with NUMA, but it looks like we just need to
> > specify a proximity domain in relation to the SRAT table, for each
> > viommu? The SMMUv3 node in IORT has a 4-bytes "proximity domain"
> > field for this. We can add the same to the VIOT virtio-iommu nodes
> > later, since the structures are extensible.
> > 
> I think there the proximity domain is more for each assigned device
> than vIOMMU. vIOMMU in the guest can have assigned devices belong to
> different pIOMMU and proximity domains. If the guest owns the first
> level page tables (gIOVA or SVA), we want to make sure page tables are
> allocated from the close proximity domain.
> 
> My understanding is virtio IOMMU supports both virtio devices and
> assigned devices. we could care less about the former in terms of NUMA.
> 
> In ACPI, we have _PXM method to retrieve device proximity domain. I
> don't know if there is something equivalent or a generic way to get
> _PXM information. I think VMM also need to make sure when an assigned
> device is used with vIOMMU, there are some memory is allocated from the
> device's proximity domain.
> 
> > But it might be better to keep the bare minimum information in the FW
> > descriptor, and put the rest in the virtio-iommu. So yes topology
> > enumeration is something the device cannot do itself (not fully that
> > is, see (2)) but for the rest, virtio-iommu's PROBE request can
> > provide details about each endpoint in relation to their physical
> > IOMMU.
> > 
> > We could for example add a bit in a PROBE property saying that the
> > whole path between the IOMMU and the endpoint supports ATS. For NUMA
> > it might also be more interesting to have a finer granularity, since
> > one viommu could be managing endpoints that are behind different
> > physical IOMMUs. If in the future we want to allocate page tables
> > close to the physical IOMMU for example, we might need to describe
> > multiple NUMA nodes per viommu, using the PROBE request.
> > 
> Should we reinvent something for NUMA or use ACPI's SRAT, _PXM? 

Regardless whether we put it in the VIOT or in the virtio-iommu PROBE
request, we necessarily need to reuse the node IDs defined by SRAT (or
numa-node-id on devicetree, also a 32-bit value). A virtio-pci based
virtio-iommu already has the _PXM of its closest bridge and wouldn't need
anything more in the VIOT, while a virtio-mmio based virtio-iommu would
need a proximity domain field in the VIOT. That could be added later since
the table is extensible, but as you pointed out, that information might
not be very useful.

> I am not sure how it is handled today in QEMU in terms of guest-host
> NUMA proximity domain mapping.

It looks like the user can specify this guest-host mapping on the
command-line:

  -object memory-backend-ram,id=mem0,size=4G,host-nodes=3,policy=bind
  -object memory-backend-ram,id=mem1,size=4G,host-nodes=4,policy=bind
  -numa node,memdev=mem0,nodeid=numa0
  -numa node,memdev=mem1,nodeid=numa1
  -numa cpu,node-id=numa0,socket-id=0
  -numa cpu,node-id=numa1,socket-id=1

numa0 and numa1 would get proximity domains 0 and 1, corresponding to host
domains 3 and 4. It is also possible to specify the NUMA node of a PCI bus
(via the PCI expander bridge), and therefore to assign a VFIO PCI device
in the same proximity domain as its physical location.

  -device pxb,id=bridge1,bus=pci.0,numa_node=1 (simplified)
  -device vfio-pci,host=03:01.0,bus=bridge1

Linux can use this information to allocate DMA close to the endp

[virtio-dev] Re: [RFC 00/13] virtio-iommu on non-devicetree platforms

2019-11-25 Thread Jean-Philippe Brucker
On Fri, Nov 22, 2019 at 04:01:02PM -0800, Jacob Pan (Jun) wrote:
> > (1) ACPI has one table per vendor (DMAR for Intel, IVRS for AMD and
> > IORT for Arm). From my point of view IORT is easier to extend, since
> > we just need to introduce a new node type. There are no dependencies
> > to Arm in the Linux IORT driver, so it works well with CONFIG_X86.
> > 
> From my limited understanding, IORT and VIOT is to solve device topology
> enumeration only? I am not sure how it can be expanded to cover
> information beyond device topology. e.g. DMAR has NUMA information and
> root port ATS, I guess they are not used today in the guest but might
> be additions in the future.

The PCI root-complex node of IORT has an ATS attribute, which we can
already use. However its scope is the root complex, not individual root
ports like with DMAR.

I'm not very familiar with NUMA, but it looks like we just need to specify
a proximity domain in relation to the SRAT table, for each viommu? The
SMMUv3 node in IORT has a 4-bytes "proximity domain" field for this. We
can add the same to the VIOT virtio-iommu nodes later, since the
structures are extensible.

But it might be better to keep the bare minimum information in the FW
descriptor, and put the rest in the virtio-iommu. So yes topology
enumeration is something the device cannot do itself (not fully that is,
see (2)) but for the rest, virtio-iommu's PROBE request can provide
details about each endpoint in relation to their physical IOMMU.

We could for example add a bit in a PROBE property saying that the whole
path between the IOMMU and the endpoint supports ATS. For NUMA it might
also be more interesting to have a finer granularity, since one viommu
could be managing endpoints that are behind different physical IOMMUs. If
in the future we want to allocate page tables close to the physical IOMMU
for example, we might need to describe multiple NUMA nodes per viommu,
using the PROBE request.

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [RFC 00/13] virtio-iommu on non-devicetree platforms

2019-11-25 Thread Jean-Philippe Brucker
On Fri, Nov 22, 2019 at 08:00:46AM -0500, Michael S. Tsirkin wrote:
> > (2) In addition, there are some concerns about having virtio depend on
> > ACPI or DT. Some hypervisors (Firecracker, QEMU microvm, kvmtool x86
> > [1])
> 
> power?

In kvmtool it boot with device tree. It also doesn't need virtio-iommu I
think, since it has its own paravirtualized interface.

> > don't currently implement those methods.
> > 
> > It was suggested to embed the topology description into the device.
> > It can work, as demonstrated at the end of this RFC, with the
> > following limitations:
> > 
> > - The topology description must be read before any endpoint managed
> >   by the IOMMU is probed, and even before the virtio module is
> >   loaded. This RFC uses a PCI quirk to manually parse the virtio
> >   configuration. It assumes that all endpoints managed by the IOMMU
> >   are under this same PCI host.
> > 
> > - I don't have a solution for the virtio-mmio transport at the
> >   moment, because I haven't had time to modify a host to test it. I
> >   think it could either use a notifier on the platform bus, or
> >   better, a new 'iommu' command-line argument to the virtio-mmio
> >   driver.
> 
>   A notifier seems easier for users. What are the disadvantages of
>   that?

For each device we have to check if it's virtio-mmio, then map the MMIO
resource and check the device type. Having a dedicated command-line
argument would be more efficient.

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH RESEND v4 1/1] Add virtio-iommu device specification

2019-11-25 Thread Jean-Philippe Brucker
Hi Jan,

On Mon, Nov 25, 2019 at 08:30:29AM +0100, Jan Kiszka wrote:
> What's the impact of a fault on the device(s) under the IOMMU regime? Can
> they recover?

Are you asking about what happens to the endpoints when the virtio-iommu
encounters an internal error? Or what happens to the endpoints if their
DMA transactions fails translation? I think they are both equivalent to
"what happens when the endpoint's memory transaction aborts?". The answer
to that depends on the bus and endpoint, and is out of scope. The
virtio-iommu spec could state that in those cases, we abort the memory
transaction, but it's too vague since we don't know the specifics of the
bus, and it isn't necessarily true (see VT-d and SMMU below).

> Or will they get DEVICE_NEEDS_RESET as well?

If the endpoint is virtio, then the behavior upon DMA fault should be
specified by the virtio transport, because it could happen without an
IOMMU (e.g. trying to access a physical address that isn't mapped to RAM
or MMIO), or with a VT-d emulation for example.

But it's not necessarily virtio. It can be a hardware passed-through
endpoint, in which case the abort behavior depends on the physical IOMMU,
which virtio-iommu doesn't know anything about, in addition to the
physical bus and endpoint.

I also wouldn't state that the whole device (or function, though we're not
necessarily PCI) needs reset. It might be possible for some devices to
only stop the faulting queue and leave the others running, to avoid
disturbing the rest of the system.

> With PCI device
> behind real IOMMUs, it's normal that they need a reset after having caused a
> fault. I'm not sure if this is described in the related specs for them, but
> it should be clarify for the virtual IOMMU. But this can be done on top,
> IMHO.

The device behaviour is generally not specified. However their spec can
say something about the bus:

* For Intel VT-d see 7.2 and 7.2.1 (Non-Recoverable Address Translation
  Faults), where the spec provides various implementation examples.

  "Requests that encounter non-recoverable address translation faults are
  aborted by the remapping hardware, and typically require a reset of the
  device (such as through a function-level-reset) to recover and
  re-initialize the device to put it back into service."

  So could be aborted, but as stated later in 7.2.1, can also be
  redirected to a catch-all memory location.

* For Arm SMMU, the host driver can specify for each context whether
  the SMMU should return an abort (Slave Error on the AMBA bus) or not
  (read-zero, write-ignore).

  The spec also says "The behavior of the client device after termination
  is specific to the device." (3.12.1 Terminate model)

* For AMD IOMMU, "when the IOMMU detects an I/O page fault, it target
  aborts the faulting request." and "the IOMMU sets the legacy PCI
  Signaled Target Abort bit, if applicable" (2.1.3.2 I/O Page Faults).
  I believe the equivalent for the PCIe bus is a Completer Abort response.

They can specify the behaviour with some precision, because they also
specify how the IOMMU is integrated with the system. We don't have this
luxury, because if the virtio-iommu is just a proxy for a physical IOMMU,
we don't know how aborts are configured, and the bus may be a variant of
PCI, AMBA or something else.

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [RFC 07/13] ACPI/IORT: Defer probe until virtio-iommu-pci has registered a fwnode

2019-11-22 Thread Jean-Philippe Brucker
When the IOMMU is PCI-based, IORT doesn't know the fwnode until the
driver has had a chance to register it. In addition to deferring the
probe until the IOMMU ops are set, also defer the probe until the fwspec
is available.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/acpi/iort.c | 54 ++---
 1 file changed, 31 insertions(+), 23 deletions(-)

diff --git a/drivers/acpi/iort.c b/drivers/acpi/iort.c
index b517aa4e83ba..f08f72d8af78 100644
--- a/drivers/acpi/iort.c
+++ b/drivers/acpi/iort.c
@@ -61,6 +61,22 @@ static bool iort_type_matches(u8 type, enum 
iort_node_category category)
}
 }
 
+static inline bool iort_iommu_driver_enabled(u8 type)
+{
+   switch (type) {
+   case ACPI_IORT_NODE_SMMU_V3:
+   return IS_BUILTIN(CONFIG_ARM_SMMU_V3);
+   case ACPI_IORT_NODE_SMMU:
+   return IS_BUILTIN(CONFIG_ARM_SMMU);
+   case ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU:
+   case ACPI_VIOT_IORT_NODE_VIRTIO_PCI_IOMMU:
+   return IS_ENABLED(CONFIG_VIRTIO_IOMMU);
+   default:
+   pr_warn("IORT node type %u does not describe an IOMMU\n", type);
+   return false;
+   }
+}
+
 /**
  * iort_set_fwnode() - Create iort_fwnode and use it to register
  *iommu data in the iort_fwnode_list
@@ -102,9 +118,9 @@ static inline int iort_set_fwnode(struct acpi_iort_node 
*iort_node,
  *
  * Returns: fwnode_handle pointer on success, NULL on failure
  */
-static inline struct fwnode_handle *iort_get_fwnode(
-   struct acpi_iort_node *node)
+static inline struct fwnode_handle *iort_get_fwnode(struct acpi_iort_node 
*node)
 {
+   int err = -ENODEV;
struct iort_fwnode *curr;
struct fwnode_handle *fwnode = NULL;
 
@@ -112,12 +128,20 @@ static inline struct fwnode_handle *iort_get_fwnode(
list_for_each_entry(curr, _fwnode_list, list) {
if (curr->iort_node == node) {
fwnode = curr->fwnode;
+   if (!fwnode && curr->pci_devid) {
+   /*
+* Postpone probe until virtio-iommu has
+* registered its fwnode.
+*/
+   err = iort_iommu_driver_enabled(node->type) ?
+   -EPROBE_DEFER : -ENODEV;
+   }
break;
}
}
spin_unlock(_fwnode_lock);
 
-   return fwnode;
+   return fwnode ?: ERR_PTR(err);
 }
 
 /**
@@ -874,22 +898,6 @@ int iort_iommu_msi_get_resv_regions(struct device *dev, 
struct list_head *head)
return (resv == its->its_count) ? resv : -ENODEV;
 }
 
-static inline bool iort_iommu_driver_enabled(u8 type)
-{
-   switch (type) {
-   case ACPI_IORT_NODE_SMMU_V3:
-   return IS_BUILTIN(CONFIG_ARM_SMMU_V3);
-   case ACPI_IORT_NODE_SMMU:
-   return IS_BUILTIN(CONFIG_ARM_SMMU);
-   case ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU:
-   case ACPI_VIOT_IORT_NODE_VIRTIO_PCI_IOMMU:
-   return IS_ENABLED(CONFIG_VIRTIO_IOMMU);
-   default:
-   pr_warn("IORT node type %u does not describe an IOMMU\n", type);
-   return false;
-   }
-}
-
 static int arm_smmu_iort_xlate(struct device *dev, u32 streamid,
   struct fwnode_handle *fwnode,
   const struct iommu_ops *ops)
@@ -920,8 +928,8 @@ static int iort_iommu_xlate(struct device *dev, struct 
acpi_iort_node *node,
return -ENODEV;
 
iort_fwnode = iort_get_fwnode(node);
-   if (!iort_fwnode)
-   return -ENODEV;
+   if (IS_ERR(iort_fwnode))
+   return PTR_ERR(iort_fwnode);
 
/*
 * If the ops look-up fails, this means that either
@@ -1618,8 +1626,8 @@ static int __init iort_add_platform_device(struct 
acpi_iort_node *node,
 
fwnode = iort_get_fwnode(node);
 
-   if (!fwnode) {
-   ret = -ENODEV;
+   if (IS_ERR(fwnode)) {
+   ret = PTR_ERR(fwnode);
goto dev_put;
}
 
-- 
2.24.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [RFC 06/13] ACPI/IORT: Support VIOT virtio-pci node

2019-11-22 Thread Jean-Philippe Brucker
When virtio-iommu uses the PCI transport, IORT doesn't instantiate the
device and doesn't create a fwnode. They will be created later by the
PCI subsystem. Store the information needed to identify the IOMMU in
iort_fwnode_list.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/acpi/iort.c | 117 +++-
 1 file changed, 93 insertions(+), 24 deletions(-)

diff --git a/drivers/acpi/iort.c b/drivers/acpi/iort.c
index adc5953fffa5..b517aa4e83ba 100644
--- a/drivers/acpi/iort.c
+++ b/drivers/acpi/iort.c
@@ -30,10 +30,17 @@ struct iort_its_msi_chip {
u32 translation_id;
 };
 
+struct iort_pci_devid {
+   u16 segment;
+   u8 bus;
+   u8 devfn;
+};
+
 struct iort_fwnode {
struct list_head list;
struct acpi_iort_node *iort_node;
struct fwnode_handle *fwnode;
+   struct iort_pci_devid *pci_devid;
 };
 static LIST_HEAD(iort_fwnode_list);
 static DEFINE_SPINLOCK(iort_fwnode_lock);
@@ -44,7 +51,8 @@ static bool iort_type_matches(u8 type, enum 
iort_node_category category)
case IORT_IOMMU_TYPE:
return type == ACPI_IORT_NODE_SMMU ||
   type == ACPI_IORT_NODE_SMMU_V3 ||
-  type == ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU;
+  type == ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU ||
+  type == ACPI_VIOT_IORT_NODE_VIRTIO_PCI_IOMMU;
case IORT_MSI_TYPE:
return type == ACPI_IORT_NODE_ITS_GROUP;
default:
@@ -59,12 +67,14 @@ static bool iort_type_matches(u8 type, enum 
iort_node_category category)
  *
  * @node: IORT table node associated with the IOMMU
  * @fwnode: fwnode associated with the IORT node
+ * @pci_devid: pci device ID associated with the IORT node, may be NULL
  *
  * Returns: 0 on success
  *  <0 on failure
  */
 static inline int iort_set_fwnode(struct acpi_iort_node *iort_node,
- struct fwnode_handle *fwnode)
+ struct fwnode_handle *fwnode,
+ struct iort_pci_devid *pci_devid)
 {
struct iort_fwnode *np;
 
@@ -76,6 +86,7 @@ static inline int iort_set_fwnode(struct acpi_iort_node 
*iort_node,
INIT_LIST_HEAD(>list);
np->iort_node = iort_node;
np->fwnode = fwnode;
+   np->pci_devid = pci_devid;
 
spin_lock(_fwnode_lock);
list_add_tail(>list, _fwnode_list);
@@ -121,6 +132,7 @@ static inline void iort_delete_fwnode(struct acpi_iort_node 
*node)
spin_lock(_fwnode_lock);
list_for_each_entry_safe(curr, tmp, _fwnode_list, list) {
if (curr->iort_node == node) {
+   kfree(curr->pci_devid);
list_del(>list);
kfree(curr);
break;
@@ -870,6 +882,7 @@ static inline bool iort_iommu_driver_enabled(u8 type)
case ACPI_IORT_NODE_SMMU:
return IS_BUILTIN(CONFIG_ARM_SMMU);
case ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU:
+   case ACPI_VIOT_IORT_NODE_VIRTIO_PCI_IOMMU:
return IS_ENABLED(CONFIG_VIRTIO_IOMMU);
default:
pr_warn("IORT node type %u does not describe an IOMMU\n", type);
@@ -1451,6 +1464,28 @@ static void __init viommu_mmio_dma_configure(struct 
device *dev,
acpi_dma_configure(dev, attr);
 }
 
+static __init struct iort_pci_devid *
+viommu_pci_get_devid(struct acpi_iort_node *node)
+{
+   unsigned int val;
+   struct iort_pci_devid *devid;
+   struct acpi_viot_iort_virtio_pci_iommu *viommu;
+
+   viommu = (struct acpi_viot_iort_virtio_pci_iommu *)node->node_data;
+
+   val = le32_to_cpu(viommu->devid);
+
+   devid = kzalloc(sizeof(*devid), GFP_KERNEL);
+   if (!devid)
+   return ERR_PTR(-ENOMEM);
+
+   devid->segment = val >> 16;
+   devid->bus = PCI_BUS_NUM(val);
+   devid->devfn = val & 0xff;
+
+   return devid;
+}
+
 struct iort_dev_config {
const char *name;
int (*dev_init)(struct acpi_iort_node *node);
@@ -1462,6 +1497,7 @@ struct iort_dev_config {
int (*dev_set_proximity)(struct device *dev,
struct acpi_iort_node *node);
int (*dev_add_platdata)(struct platform_device *pdev);
+   struct iort_pci_devid *(*dev_get_pci_devid)(struct acpi_iort_node 
*node);
 };
 
 static const struct iort_dev_config iort_arm_smmu_v3_cfg __initconst = {
@@ -1494,6 +1530,10 @@ static const struct iort_dev_config iort_viommu_mmio_cfg 
__initconst = {
.dev_init_resources = viommu_mmio_init_resources,
 };
 
+static const struct iort_dev_config iort_viommu_pci_cfg __initconst = {
+   .dev_get_pci_devid = viommu_pci_get_devid,
+};
+
 static __init const struct iort_dev_config *iort_get_dev_cfg(
struct acpi_iort_node *node)
 {
@@ -1510,6 +1550,8 @@

[virtio-dev] [RFC 13/13] iommu/virtio: Add topology description to

2019-11-22 Thread Jean-Philippe Brucker
Some hypervisors don't implement either device-tree or ACPI, but still
need a method to describe the IOMMU topology. Read the virtio-iommu
config early and parse the topology description. Hook into the
dma_setup() callbacks to initialize the IOMMU before probing endpoints.

If the virtio-iommu uses the virtio-pci transport, this will only work
if the PCI root complex is the first device probed. We don't currently
support virtio-mmio.

Initially I tried to generate a fake IORT table and feed it to the IORT
driver, in order to avoid rewriting the whole DMA code, but it wouldn't
work with platform endpoints, which are references to items in the ACPI
table on IORT.

Signed-off-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 

---
Note that we only call virt_dma_configure() if the host didn't provide
either DT or ACPI method. If you want to test this with QEMU, you'll
need to manually disable the acpi_dma_configure() part in pci-driver.c
---
 drivers/base/platform.c   |   3 +
 drivers/iommu/Kconfig |   9 +
 drivers/iommu/Makefile|   1 +
 drivers/iommu/virtio-iommu-topology.c | 410 ++
 drivers/iommu/virtio-iommu.c  |   3 +
 drivers/pci/pci-driver.c  |   3 +
 include/linux/virtio_iommu.h  |  18 ++
 include/uapi/linux/virtio_iommu.h |  26 ++
 8 files changed, 473 insertions(+)
 create mode 100644 drivers/iommu/virtio-iommu-topology.c
 create mode 100644 include/linux/virtio_iommu.h

diff --git a/drivers/base/platform.c b/drivers/base/platform.c
index b230beb6ccb4..70b12c8ef2fb 100644
--- a/drivers/base/platform.c
+++ b/drivers/base/platform.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "base.h"
 #include "power/power.h"
@@ -1257,6 +1258,8 @@ int platform_dma_configure(struct device *dev)
} else if (has_acpi_companion(dev)) {
attr = acpi_get_dma_attr(to_acpi_device_node(dev->fwnode));
ret = acpi_dma_configure(dev, attr);
+   } else if (IS_ENABLED(CONFIG_VIRTIO_IOMMU_TOPOLOGY)) {
+   ret = virt_dma_configure(dev);
}
 
return ret;
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index e6eb4f238d1a..d02c0d36019d 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -486,4 +486,13 @@ config VIRTIO_IOMMU
 
  Say Y here if you intend to run this kernel as a guest.
 
+config VIRTIO_IOMMU_TOPOLOGY
+   bool "Topology properties for the virtio-iommu"
+   depends on VIRTIO_IOMMU
+   help
+ Enable early probing of the virtio-iommu device, to detect the
+ topology description.
+
+ Say Y here if you intend to run this kernel as a guest.
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 4f405f926e73..6b51c4186ebc 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -35,3 +35,4 @@ obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
 obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
 obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
+obj-$(CONFIG_VIRTIO_IOMMU_TOPOLOGY) += virtio-iommu-topology.o
diff --git a/drivers/iommu/virtio-iommu-topology.c 
b/drivers/iommu/virtio-iommu-topology.c
new file mode 100644
index ..ec22510ace3d
--- /dev/null
+++ b/drivers/iommu/virtio-iommu-topology.c
@@ -0,0 +1,410 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct viommu_cap_config {
+   u8 pos; /* PCI capability position */
+   u8 bar;
+   u32 length; /* structure size */
+   u32 offset; /* structure offset within the bar */
+};
+
+struct viommu_spec {
+   struct device   *dev; /* transport device */
+   struct fwnode_handle*fwnode;
+   struct iommu_ops*ops;
+   struct list_headtopology;
+   struct list_headlist;
+};
+
+struct viommu_topology {
+   union {
+   struct virtio_iommu_topo_head head;
+   struct virtio_iommu_topo_pci_range pci;
+   struct virtio_iommu_topo_endpoint ep;
+   };
+   /* Index into viommu_spec->topology */
+   struct list_head list;
+};
+
+static LIST_HEAD(viommus);
+static DEFINE_MUTEX(viommus_lock);
+
+#define VPCI_FIELD(field) offsetof(struct virtio_pci_cap, field)
+
+static inline int viommu_find_capability(struct pci_dev *dev, u8 cfg_type,
+struct viommu_cap_config *cap)
+{
+   int pos;
+   u8 bar;
+
+   for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
+pos > 0;
+pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
+   u8 type;
+
+   pci_read_config_byte(dev, pos + VPCI_FIELD(cfg_type), );
+   if (type != cfg_type)
+   continue;
+
+ 

[virtio-dev] [RFC 08/13] ACPI/IORT: Add callback to update a device's fwnode

2019-11-22 Thread Jean-Philippe Brucker
For a PCI-based IOMMU, IORT isn't in charge of allocating a fwnode. Let
the IOMMU driver update the fwnode associated to an IORT node when
available.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/acpi/iort.c   | 38 ++
 include/linux/acpi_iort.h |  4 
 2 files changed, 42 insertions(+)

diff --git a/drivers/acpi/iort.c b/drivers/acpi/iort.c
index f08f72d8af78..8263ab275b2b 100644
--- a/drivers/acpi/iort.c
+++ b/drivers/acpi/iort.c
@@ -1038,11 +1038,49 @@ const struct iommu_ops *iort_iommu_configure(struct 
device *dev)
 
return ops;
 }
+
+/**
+ * iort_iommu_update_fwnode - update fwnode of a PCI IOMMU
+ * @dev: the IOMMU device
+ * @fwnode: the fwnode, or NULL to remove an existing fwnode
+ *
+ * A PCI device isn't instantiated by the IORT driver. The IOMMU driver sets or
+ * removes its fwnode using this function.
+ */
+void iort_iommu_update_fwnode(struct device *dev, struct fwnode_handle *fwnode)
+{
+   struct pci_dev *pdev;
+   struct iort_fwnode *curr;
+   struct iort_pci_devid *devid;
+
+   if (!dev_is_pci(dev))
+   return;
+
+   pdev = to_pci_dev(dev);
+
+   spin_lock(_fwnode_lock);
+   list_for_each_entry(curr, _fwnode_list, list) {
+   devid = curr->pci_devid;
+   if (devid &&
+   pci_domain_nr(pdev->bus) == devid->segment &&
+   pdev->bus->number == devid->bus &&
+   pdev->devfn == devid->devfn) {
+   WARN_ON(fwnode && curr->fwnode);
+   curr->fwnode = fwnode;
+   break;
+   }
+   }
+   spin_unlock(_fwnode_lock);
+}
+EXPORT_SYMBOL_GPL(iort_iommu_update_fwnode);
 #else
 int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head)
 { return 0; }
 const struct iommu_ops *iort_iommu_configure(struct device *dev)
 { return NULL; }
+static void iort_iommu_update_fwnode(struct device *dev,
+struct fwnode_handle *fwnode)
+{ }
 #endif
 
 static int nc_dma_get_range(struct device *dev, u64 *size)
diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
index f4db5fff07cf..840635e40d9d 100644
--- a/include/linux/acpi_iort.h
+++ b/include/linux/acpi_iort.h
@@ -43,6 +43,7 @@ int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id);
 void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *size);
 const struct iommu_ops *iort_iommu_configure(struct device *dev);
 int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head);
+void iort_iommu_update_fwnode(struct device *dev, struct fwnode_handle 
*fwnode);
 #else
 static void acpi_iort_register_table(struct acpi_table_header *table,
 enum iort_table_source source)
@@ -63,6 +64,9 @@ static inline const struct iommu_ops *iort_iommu_configure(
 static inline
 int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head)
 { return 0; }
+static void iort_iommu_update_fwnode(struct device *dev,
+struct fwnode_handle *fwnode)
+{ }
 #endif
 
 #endif /* __ACPI_IORT_H__ */
-- 
2.24.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [RFC virtio 12/13] virtio-iommu: Add built-in topology description

2019-11-22 Thread Jean-Philippe Brucker
Add a lightweight method to describe the IOMMU topology in the config
space, guarded by a new feature bit. A list of capabilities in the
config space describes the devices managed by the IOMMU and their
endpoint IDs.

Signed-off-by: Jean-Philippe Brucker 
---
 virtio-iommu.tex | 88 
 1 file changed, 88 insertions(+)

diff --git a/virtio-iommu.tex b/virtio-iommu.tex
index 28c562b..2b29873 100644
--- a/virtio-iommu.tex
+++ b/virtio-iommu.tex
@@ -67,6 +67,9 @@ \subsection{Feature bits}\label{sec:Device Types / IOMMU 
Device / Feature bits}
 
 \item[VIRTIO_IOMMU_F_MMIO (5)]
   The VIRTIO_IOMMU_MAP_F_MMIO flag is available.
+
+\item[VIRTIO_IOMMU_F_TOPOLOGY (6)]
+  Topology description is available at \field{topo_offset}.
 \end{description}
 
 \drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device / 
Feature bits}
@@ -97,6 +100,7 @@ \subsection{Device configuration layout}\label{sec:Device 
Types / IOMMU Device /
 le32 end;
   } domain_range;
   le32 probe_size;
+  le16 topo_offset;
 };
 \end{lstlisting}
 
@@ -141,6 +145,90 @@ \subsection{Device initialization}\label{sec:Device Types 
/ IOMMU Device / Devic
 If the driver does not accept the VIRTIO_IOMMU_F_BYPASS feature, the
 device SHOULD NOT let endpoints access the guest-physical address space.
 
+\subsubsection{Built-in topology description}\label{sec:Device Types / IOMMU 
Device / Device initialization / topology}
+
+The device manages memory accesses from endpoints, identified by endpoint
+IDs. The driver can discover which endpoint ID corresponds to an endpoint
+using several methods, depending on the platform. Platforms described
+with device tree use the \texttt{iommus} and \texttt{iommu-map} properties
+embedded into device nodes for this purpose. Platforms described with
+ACPI use a table such as the Virtual I/O Table. Platforms that do not
+support either device tree or ACPI may embed a minimalistic description
+in the device configuration space.
+
+An important disadvantage of describing the topology from within the
+device is the lack of initialization ordering information. Out-of-band
+descriptions such as device tree and ACPI let the operating system know
+about device dependencies so that it can initialize supplier devices
+(IOMMUs) before their consumers (endpoints). Platforms using the
+VIRTIO_IOMMU_F_TOPOLOGY feature have to communicate the device dependency
+in another way.
+
+If the VIRTIO_IOMMU_F_TOPOLOGY feature is negotiated, \field{topo_offset}
+is the offset between the beginning of the device-specific configuration
+space (virtio_iommu_config) and the first topology structure header. A
+topology structures defines the endpoint ID of one or more endpoints
+managed by the virtio-iommu device.
+
+\begin{lstlisting}
+struct virtio_iommu_topo_head {
+  le16 type;
+  le16 next;
+};
+\end{lstlisting}
+
+\field{next} is the offset between the beginning of the device-specific
+configuration space and the next topology structure header. When
+\field{next} is zero, this is the last structure.
+
+\field{type} describes the type of structure:
+\begin{description}
+  \item[VIRTIO_IOMMU_TOPO_PCI_RANGE (0)] struct virtio_iommu_topo_pci_range
+  \item[VIRTIO_IOMMU_TOPO_ENDPOINT (1)] struct virtio_iommu_topo_endpoint
+\end{description}
+
+\paragraph{PCI range}\label{sec:Device Types / IOMMU Device / Device 
initialization / topology / PCI range}
+
+\begin{lstlisting}
+struct virtio_iommu_topo_pci_range {
+  struct virtio_iommu_topo_head head;
+  le32 endpoint_start;
+  le16 hierarchy;
+  le16 requester_start;
+  le16 requester_end;
+  le16 reserved;
+};
+\end{lstlisting}
+
+The PCI range structure describes the endpoint IDs of a series of PCI
+devices.
+
+\begin{description}
+  \item[\field{hierarchy}] Identifier of the PCI hierarchy. Sometimes
+called PCI segment or domain number.
+  \item[\field{requester_start}] First requester ID in the range.
+  \item[\field{requester_end}] Last requester ID in the range.
+  \item[\field{endpoint_start}] First endpoint ID.
+\end{description}
+
+The correspondence between a PCI requester ID in the range
+[ requester_start; requester_end ] and its endpoint IDs is a linear
+transformation: endpoint_id = requester_id - requester_start +
+endpoint_start.
+
+\paragraph{Single endpoint}\label{sec:Device Types / IOMMU Device / Device 
initialization / topology / Single endpoint}
+
+\begin{lstlisting}
+struct virtio_iommu_topo_endpoint {
+  struct virtio_iommu_topo_head head;
+  le32 endpoint;
+  le64 address;
+};
+\end{lstlisting}
+
+\field{endpoint} is the ID of a single endpoint, identified by its first
+MMIO address in the physical address space.
+
 \subsection{Device operations}\label{sec:Device Types / IOMMU Device / Device 
operations}
 
 Driver send requests on the request virtqueue, notifies the device and
-- 
2.24.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr

[virtio-dev] [RFC 09/13] iommu/virtio: Create fwnode if necessary

2019-11-22 Thread Jean-Philippe Brucker
The presence of a fwnode on a PCI device depends on the platform. QEMU
q35, for example, creates an ACPI description for each PCI slot, but
QEMU virt (aarch64) doesn't. Since the IOMMU subsystem relies heavily on
fwnode to discover the DMA topology, create a fwnode for the
virtio-iommu if necessary, using the software_node framework.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c | 56 
 1 file changed, 51 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 3ea9d7682999..8efa368134c0 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -966,6 +966,48 @@ static struct iommu_ops viommu_ops = {
.of_xlate   = viommu_of_xlate,
 };
 
+static int viommu_set_fwnode(struct viommu_dev *viommu)
+{
+   /*
+* viommu->dev is the virtio device, its parent is the associated
+* transport device.
+*/
+   struct device *dev = viommu->dev->parent;
+
+   /*
+* With device tree a fwnode is always present. With ACPI, on some
+* platforms a PCI device has a DSDT node describing the slot. On other
+* platforms, no fwnode is created and we have to do it ourselves.
+*/
+   if (!dev->fwnode) {
+   struct fwnode_handle *fwnode;
+
+   fwnode = fwnode_create_software_node(NULL, NULL);
+   if (IS_ERR(fwnode))
+   return PTR_ERR(fwnode);
+
+   set_primary_fwnode(dev, fwnode);
+   }
+
+   iommu_device_set_fwnode(>iommu, dev->fwnode);
+   return 0;
+}
+
+static void viommu_clear_fwnode(struct viommu_dev *viommu)
+{
+   struct device *dev = viommu->dev->parent;
+
+   if (!dev->fwnode)
+   return;
+
+   if (is_software_node(dev->fwnode)) {
+   struct fwnode_handle *fwnode = dev->fwnode;
+
+   set_primary_fwnode(dev, NULL);
+   fwnode_remove_software_node(fwnode);
+   }
+}
+
 static int viommu_init_vqs(struct viommu_dev *viommu)
 {
struct virtio_device *vdev = dev_to_virtio(viommu->dev);
@@ -1004,7 +1046,6 @@ static int viommu_fill_evtq(struct viommu_dev *viommu)
 
 static int viommu_probe(struct virtio_device *vdev)
 {
-   struct device *parent_dev = vdev->dev.parent;
struct viommu_dev *viommu = NULL;
struct device *dev = >dev;
u64 input_start = 0;
@@ -1084,9 +1125,11 @@ static int viommu_probe(struct virtio_device *vdev)
if (ret)
goto err_free_vqs;
 
-   iommu_device_set_ops(>iommu, _ops);
-   iommu_device_set_fwnode(>iommu, parent_dev->fwnode);
+   ret = viommu_set_fwnode(viommu);
+   if (ret)
+   goto err_sysfs_remove;
 
+   iommu_device_set_ops(>iommu, _ops);
iommu_device_register(>iommu);
 
 #ifdef CONFIG_PCI
@@ -1119,8 +1162,10 @@ static int viommu_probe(struct virtio_device *vdev)
return 0;
 
 err_unregister:
-   iommu_device_sysfs_remove(>iommu);
iommu_device_unregister(>iommu);
+   viommu_clear_fwnode(viommu);
+err_sysfs_remove:
+   iommu_device_sysfs_remove(>iommu);
 err_free_vqs:
vdev->config->del_vqs(vdev);
 
@@ -1131,8 +1176,9 @@ static void viommu_remove(struct virtio_device *vdev)
 {
struct viommu_dev *viommu = vdev->priv;
 
-   iommu_device_sysfs_remove(>iommu);
iommu_device_unregister(>iommu);
+   viommu_clear_fwnode(viommu);
+   iommu_device_sysfs_remove(>iommu);
 
/* Stop all virtqueues */
vdev->config->reset(vdev);
-- 
2.24.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [RFC 11/13] ACPI: Add VIOT table

2019-11-22 Thread Jean-Philippe Brucker
Add support for a new ACPI table that embeds other tables describing a
platform's IOMMU topology. Currently the only supported base table is
IORT. The VIOT contains an IORT with additional node types, that
describe a virtio-iommu.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/acpi/Kconfig  |  4 
 drivers/acpi/Makefile |  1 +
 drivers/acpi/bus.c|  2 ++
 drivers/acpi/tables.c |  2 +-
 drivers/acpi/viot.c   | 44 +++
 drivers/iommu/Kconfig |  1 +
 include/linux/acpi_viot.h | 20 ++
 7 files changed, 73 insertions(+), 1 deletion(-)
 create mode 100644 drivers/acpi/viot.c
 create mode 100644 include/linux/acpi_viot.h

diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 548976c8b2b0..513a5e4d3526 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -579,6 +579,10 @@ config TPS68470_PMIC_OPREGION
 config ACPI_IORT
bool
 
+config ACPI_VIOT
+   bool
+   select ACPI_IORT
+
 endif  # ACPI
 
 config X86_PM_TIMER
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 9d1792165713..6abdc6cc32c7 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -124,3 +124,4 @@ obj-y   += dptf/
 
 obj-$(CONFIG_ARM64)+= arm64/
 obj-$(CONFIG_ACPI_IORT)+= iort.o
+obj-$(CONFIG_ACPI_VIOT)+= viot.o
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 48bc96d45bab..6f364e0c9240 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -25,6 +25,7 @@
 #include 
 #endif
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1246,6 +1247,7 @@ static int __init acpi_init(void)
 
pci_mmcfg_late_init();
acpi_iort_init();
+   acpi_viot_init();
acpi_scan_init();
acpi_ec_init();
acpi_debugfs_init();
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 180ac4329763..9662ea5e1064 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -501,7 +501,7 @@ static const char * const table_sigs[] = {
ACPI_SIG_WDDT, ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT,
ACPI_SIG_PSDT, ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT,
ACPI_SIG_IORT, ACPI_SIG_NFIT, ACPI_SIG_HMAT, ACPI_SIG_PPTT,
-   NULL };
+   ACPI_SIG_VIOT, NULL };
 
 #define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
 
diff --git a/drivers/acpi/viot.c b/drivers/acpi/viot.c
new file mode 100644
index ..ab9a6e43ad9b
--- /dev/null
+++ b/drivers/acpi/viot.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2019 Linaro
+ *
+ * Virtual IOMMU table
+ */
+#define pr_fmt(fmt)"ACPI: VIOT: " fmt
+
+#include 
+#include 
+#include 
+
+int __init acpi_viot_init(void)
+{
+   struct acpi_table_viot *viot;
+   struct acpi_table_header *acpi_header;
+   acpi_status status;
+
+   status = acpi_get_table(ACPI_SIG_VIOT, 0, _header);
+   if (ACPI_FAILURE(status)) {
+   if (status != AE_NOT_FOUND) {
+   const char *msg = acpi_format_exception(status);
+
+   pr_err("Failed to get table, %s\n", msg);
+   return -EINVAL;
+   }
+
+   return 0;
+   }
+
+   if (acpi_header->length < sizeof(*viot)) {
+   pr_err("VIOT table overflow, bad table!\n");
+   return -EINVAL;
+   }
+
+   viot = (struct acpi_table_viot *)acpi_header;
+   if (ACPI_COMPARE_NAMESEG(viot->base_table.signature, ACPI_SIG_IORT)) {
+   acpi_iort_register_table(>base_table, IORT_SOURCE_VIOT);
+   return 0;
+   }
+
+   pr_err("Unknown base table header\n");
+   return -EINVAL;
+}
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index e3842eabcfdd..e6eb4f238d1a 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -480,6 +480,7 @@ config VIRTIO_IOMMU
depends on ARM64
select IOMMU_API
select INTERVAL_TREE
+   select ACPI_VIOT if ACPI
help
  Para-virtualised IOMMU driver with virtio.
 
diff --git a/include/linux/acpi_viot.h b/include/linux/acpi_viot.h
new file mode 100644
index ..6c282d5eb793
--- /dev/null
+++ b/include/linux/acpi_viot.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2019 Linaro
+ */
+
+#ifndef __ACPI_VIOT_H__
+#define __ACPI_VIOT_H__
+
+#ifdef CONFIG_ACPI_VIOT
+
+int acpi_viot_init(void);
+
+#else /* !CONFIG_ACPI_VIOT */
+
+static inline int acpi_viot_init(void)
+{}
+
+#endif /* !CONFIG_ACPI_VIOT */
+
+#endif /* __ACPI_VIOT_H__ */
-- 
2.24.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [RFC 03/13] ACPI/IORT: Allow registration of external tables

2019-11-22 Thread Jean-Philippe Brucker
Add a function to register an IORT table from an external source.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/acpi/iort.c   | 22 --
 include/linux/acpi_iort.h | 10 ++
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/iort.c b/drivers/acpi/iort.c
index d62a9ea26fae..9c6c91e06f8f 100644
--- a/drivers/acpi/iort.c
+++ b/drivers/acpi/iort.c
@@ -144,6 +144,7 @@ typedef acpi_status (*iort_find_node_callback)
 
 /* Root pointer to the mapped IORT table */
 static struct acpi_table_header *iort_table;
+static enum iort_table_source iort_table_source;
 
 static LIST_HEAD(iort_msi_chip_list);
 static DEFINE_SPINLOCK(iort_msi_chip_lock);
@@ -1617,11 +1618,28 @@ static void __init iort_init_platform_devices(void)
}
 }
 
+void __init acpi_iort_register_table(struct acpi_table_header *table,
+enum iort_table_source source)
+{
+   /*
+* Firmware or hypervisor should know better than give us two IORT
+* tables.
+*/
+   if (WARN_ON(iort_table))
+   return;
+
+   iort_table = table;
+   iort_table_source = source;
+
+   iort_init_platform_devices();
+}
+
 void __init acpi_iort_init(void)
 {
acpi_status status;
+   static struct acpi_table_header *table;
 
-   status = acpi_get_table(ACPI_SIG_IORT, 0, _table);
+   status = acpi_get_table(ACPI_SIG_IORT, 0, );
if (ACPI_FAILURE(status)) {
if (status != AE_NOT_FOUND) {
const char *msg = acpi_format_exception(status);
@@ -1632,5 +1650,5 @@ void __init acpi_iort_init(void)
return;
}
 
-   iort_init_platform_devices();
+   acpi_iort_register_table(table, IORT_SOURCE_IORT);
 }
diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
index 8e7e2ec37f1b..f4db5fff07cf 100644
--- a/include/linux/acpi_iort.h
+++ b/include/linux/acpi_iort.h
@@ -11,6 +11,11 @@
 #include 
 #include 
 
+enum iort_table_source {
+   IORT_SOURCE_IORT,   /* The Real Thing */
+   IORT_SOURCE_VIOT,   /* Paravirtual extensions */
+};
+
 #define IORT_IRQ_MASK(irq) (irq & 0xULL)
 #define IORT_IRQ_TRIGGER_MASK(irq) ((irq >> 32) & 0xULL)
 
@@ -27,6 +32,8 @@ int iort_register_domain_token(int trans_id, phys_addr_t base,
 void iort_deregister_domain_token(int trans_id);
 struct fwnode_handle *iort_find_domain_token(int trans_id);
 #ifdef CONFIG_ACPI_IORT
+void acpi_iort_register_table(struct acpi_table_header *table,
+ enum iort_table_source source);
 void acpi_iort_init(void);
 u32 iort_msi_map_rid(struct device *dev, u32 req_id);
 struct irq_domain *iort_get_device_domain(struct device *dev, u32 req_id);
@@ -37,6 +44,9 @@ void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 
*size);
 const struct iommu_ops *iort_iommu_configure(struct device *dev);
 int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head);
 #else
+static void acpi_iort_register_table(struct acpi_table_header *table,
+enum iort_table_source source)
+{ }
 static inline void acpi_iort_init(void) { }
 static inline u32 iort_msi_map_rid(struct device *dev, u32 req_id)
 { return req_id; }
-- 
2.24.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [RFC 10/13] iommu/virtio: Update IORT fwnode

2019-11-22 Thread Jean-Philippe Brucker
When the virtio-iommu uses the PCI transport and the topology is
described with IORT, register the PCI fwnode with IORT.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 8efa368134c0..9847552faecc 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -7,6 +7,7 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include 
 #include 
 #include 
 #include 
@@ -989,6 +990,8 @@ static int viommu_set_fwnode(struct viommu_dev *viommu)
set_primary_fwnode(dev, fwnode);
}
 
+   /* Tell IORT about a PCI device's fwnode */
+   iort_iommu_update_fwnode(dev, dev->fwnode);
iommu_device_set_fwnode(>iommu, dev->fwnode);
return 0;
 }
@@ -1000,6 +1003,8 @@ static void viommu_clear_fwnode(struct viommu_dev *viommu)
if (!dev->fwnode)
return;
 
+   iort_iommu_update_fwnode(dev, NULL);
+
if (is_software_node(dev->fwnode)) {
struct fwnode_handle *fwnode = dev->fwnode;
 
-- 
2.24.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [RFC 01/13] ACPI/IORT: Move IORT to the ACPI folder

2019-11-22 Thread Jean-Philippe Brucker
IORT can be used (by QEMU) to describe a virtual topology containing an
architecture-agnostic paravirtualized device.

In order to build IORT for x86 systems, the driver has to be moved outside
of arm64/. Since there is nothing specific to arm64 in the driver, it
simply requires moving Makefile and Kconfig entries.

Signed-off-by: Jean-Philippe Brucker 
---
 MAINTAINERS | 9 +
 drivers/acpi/Kconfig| 3 +++
 drivers/acpi/Makefile   | 1 +
 drivers/acpi/arm64/Kconfig  | 3 ---
 drivers/acpi/arm64/Makefile | 1 -
 drivers/acpi/{arm64 => }/iort.c | 0
 6 files changed, 13 insertions(+), 4 deletions(-)
 rename drivers/acpi/{arm64 => }/iort.c (100%)

diff --git a/MAINTAINERS b/MAINTAINERS
index eb19fad370d7..9153d278f67e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -377,6 +377,15 @@ L: platform-driver-...@vger.kernel.org
 S: Maintained
 F: drivers/platform/x86/i2c-multi-instantiate.c
 
+ACPI IORT DRIVER
+M: Lorenzo Pieralisi 
+M: Hanjun Guo 
+M: Sudeep Holla 
+L: linux-a...@vger.kernel.org
+L: linux-arm-ker...@lists.infradead.org (moderated for non-subscribers)
+S: Maintained
+F: drivers/acpi/iort.c
+
 ACPI PMIC DRIVERS
 M: "Rafael J. Wysocki" 
 M: Len Brown 
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index ebe1e9e5fd81..548976c8b2b0 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -576,6 +576,9 @@ config TPS68470_PMIC_OPREGION
  region, which must be available before any of the devices
  using this, are probed.
 
+config ACPI_IORT
+   bool
+
 endif  # ACPI
 
 config X86_PM_TIMER
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 5d361e4e3405..9d1792165713 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -123,3 +123,4 @@ video-objs  += acpi_video.o video_detect.o
 obj-y  += dptf/
 
 obj-$(CONFIG_ARM64)+= arm64/
+obj-$(CONFIG_ACPI_IORT)+= iort.o
diff --git a/drivers/acpi/arm64/Kconfig b/drivers/acpi/arm64/Kconfig
index 6dba187f4f2e..d0902c85d46e 100644
--- a/drivers/acpi/arm64/Kconfig
+++ b/drivers/acpi/arm64/Kconfig
@@ -3,8 +3,5 @@
 # ACPI Configuration for ARM64
 #
 
-config ACPI_IORT
-   bool
-
 config ACPI_GTDT
bool
diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile
index 6ff50f4ed947..38771a816caf 100644
--- a/drivers/acpi/arm64/Makefile
+++ b/drivers/acpi/arm64/Makefile
@@ -1,3 +1,2 @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-$(CONFIG_ACPI_IORT)+= iort.o
 obj-$(CONFIG_ACPI_GTDT)+= gtdt.o
diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/iort.c
similarity index 100%
rename from drivers/acpi/arm64/iort.c
rename to drivers/acpi/iort.c
-- 
2.24.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [RFC 05/13] ACPI/IORT: Support VIOT virtio-mmio node

2019-11-22 Thread Jean-Philippe Brucker
Add a new type of node to the IORT driver, that describes a virtio-iommu
device based on the virtio-mmio transport. The node is only available
when the IORT is a sub-table of the VIOT.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/acpi/iort.c | 66 ++---
 1 file changed, 62 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/iort.c b/drivers/acpi/iort.c
index 1d43fbc0001f..adc5953fffa5 100644
--- a/drivers/acpi/iort.c
+++ b/drivers/acpi/iort.c
@@ -43,7 +43,8 @@ static bool iort_type_matches(u8 type, enum 
iort_node_category category)
switch (category) {
case IORT_IOMMU_TYPE:
return type == ACPI_IORT_NODE_SMMU ||
-  type == ACPI_IORT_NODE_SMMU_V3;
+  type == ACPI_IORT_NODE_SMMU_V3 ||
+  type == ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU;
case IORT_MSI_TYPE:
return type == ACPI_IORT_NODE_ITS_GROUP;
default:
@@ -868,8 +869,10 @@ static inline bool iort_iommu_driver_enabled(u8 type)
return IS_BUILTIN(CONFIG_ARM_SMMU_V3);
case ACPI_IORT_NODE_SMMU:
return IS_BUILTIN(CONFIG_ARM_SMMU);
+   case ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU:
+   return IS_ENABLED(CONFIG_VIRTIO_IOMMU);
default:
-   pr_warn("IORT node type %u does not describe an SMMU\n", type);
+   pr_warn("IORT node type %u does not describe an IOMMU\n", type);
return false;
}
 }
@@ -1408,6 +1411,46 @@ static int __init arm_smmu_v3_pmcg_add_platdata(struct 
platform_device *pdev)
return platform_device_add_data(pdev, , sizeof(model));
 }
 
+static int __init viommu_mmio_count_resources(struct acpi_iort_node *node)
+{
+   /* Mem + IRQ */
+   return 2;
+}
+
+static void __init viommu_mmio_init_resources(struct resource *res,
+  struct acpi_iort_node *node)
+{
+   int hw_irq, trigger;
+   struct acpi_viot_iort_virtio_mmio_iommu *viommu;
+
+   viommu = (struct acpi_viot_iort_virtio_mmio_iommu *)node->node_data;
+
+   res[0].start = viommu->base_address;
+   res[0].end = viommu->base_address + viommu->span - 1;
+   res[0].flags = IORESOURCE_MEM;
+
+   hw_irq = IORT_IRQ_MASK(viommu->interrupt);
+   trigger = IORT_IRQ_TRIGGER_MASK(viommu->interrupt);
+   acpi_iort_register_irq(hw_irq, "viommu", trigger, res + 1);
+}
+
+static void __init viommu_mmio_dma_configure(struct device *dev,
+ struct acpi_iort_node *node)
+{
+   enum dev_dma_attr attr;
+   struct acpi_viot_iort_virtio_mmio_iommu *viommu;
+
+   viommu = (struct acpi_viot_iort_virtio_mmio_iommu *)node->node_data;
+
+   attr = (viommu->flags & 
ACPI_VIOT_IORT_VIRTIO_MMIO_IOMMU_CACHE_COHERENT) ?
+   DEV_DMA_COHERENT : DEV_DMA_NON_COHERENT;
+
+   dev->dma_mask = >coherent_dma_mask;
+
+   /* Configure DMA for the page table walker */
+   acpi_dma_configure(dev, attr);
+}
+
 struct iort_dev_config {
const char *name;
int (*dev_init)(struct acpi_iort_node *node);
@@ -1443,6 +1486,14 @@ static const struct iort_dev_config 
iort_arm_smmu_v3_pmcg_cfg __initconst = {
.dev_add_platdata = arm_smmu_v3_pmcg_add_platdata,
 };
 
+static const struct iort_dev_config iort_viommu_mmio_cfg __initconst = {
+   /* Probe with the generic virtio-mmio driver */
+   .name = "virtio-mmio",
+   .dev_dma_configure = viommu_mmio_dma_configure,
+   .dev_count_resources = viommu_mmio_count_resources,
+   .dev_init_resources = viommu_mmio_init_resources,
+};
+
 static __init const struct iort_dev_config *iort_get_dev_cfg(
struct acpi_iort_node *node)
 {
@@ -1453,9 +1504,16 @@ static __init const struct iort_dev_config 
*iort_get_dev_cfg(
return _arm_smmu_cfg;
case ACPI_IORT_NODE_PMCG:
return _arm_smmu_v3_pmcg_cfg;
-   default:
-   return NULL;
}
+
+   if (iort_table_source == IORT_SOURCE_VIOT) {
+   switch (node->type) {
+   case ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU:
+   return _viommu_mmio_cfg;
+   }
+   }
+
+   return NULL;
 }
 
 /**
-- 
2.24.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [RFC 04/13] ACPI/IORT: Add node categories

2019-11-22 Thread Jean-Philippe Brucker
The current node filtering won't work when introducing node types
greater than 63 (such as the virtio-iommu nodes). Add
node_type_matches() to filter nodes by category.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/acpi/iort.c | 34 --
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/acpi/iort.c b/drivers/acpi/iort.c
index 9c6c91e06f8f..1d43fbc0001f 100644
--- a/drivers/acpi/iort.c
+++ b/drivers/acpi/iort.c
@@ -18,10 +18,10 @@
 #include 
 #include 
 
-#define IORT_TYPE_MASK(type)   (1 << (type))
-#define IORT_MSI_TYPE  (1 << ACPI_IORT_NODE_ITS_GROUP)
-#define IORT_IOMMU_TYPE((1 << ACPI_IORT_NODE_SMMU) |   \
-   (1 << ACPI_IORT_NODE_SMMU_V3))
+enum iort_node_category {
+   IORT_MSI_TYPE,
+   IORT_IOMMU_TYPE,
+};
 
 struct iort_its_msi_chip {
struct list_headlist;
@@ -38,6 +38,20 @@ struct iort_fwnode {
 static LIST_HEAD(iort_fwnode_list);
 static DEFINE_SPINLOCK(iort_fwnode_lock);
 
+static bool iort_type_matches(u8 type, enum iort_node_category category)
+{
+   switch (category) {
+   case IORT_IOMMU_TYPE:
+   return type == ACPI_IORT_NODE_SMMU ||
+  type == ACPI_IORT_NODE_SMMU_V3;
+   case IORT_MSI_TYPE:
+   return type == ACPI_IORT_NODE_ITS_GROUP;
+   default:
+   WARN_ON(1);
+   return false;
+   }
+}
+
 /**
  * iort_set_fwnode() - Create iort_fwnode and use it to register
  *iommu data in the iort_fwnode_list
@@ -397,7 +411,7 @@ static int iort_get_id_mapping_index(struct acpi_iort_node 
*node)
 
 static struct acpi_iort_node *iort_node_map_id(struct acpi_iort_node *node,
   u32 id_in, u32 *id_out,
-  u8 type_mask)
+  enum iort_node_category category)
 {
u32 id = id_in;
 
@@ -406,7 +420,7 @@ static struct acpi_iort_node *iort_node_map_id(struct 
acpi_iort_node *node,
struct acpi_iort_id_mapping *map;
int i, index;
 
-   if (IORT_TYPE_MASK(node->type) & type_mask) {
+   if (iort_type_matches(node->type, category)) {
if (id_out)
*id_out = id;
return node;
@@ -458,8 +472,8 @@ static struct acpi_iort_node *iort_node_map_id(struct 
acpi_iort_node *node,
 }
 
 static struct acpi_iort_node *iort_node_map_platform_id(
-   struct acpi_iort_node *node, u32 *id_out, u8 type_mask,
-   int index)
+   struct acpi_iort_node *node, u32 *id_out,
+   enum iort_node_category category, int index)
 {
struct acpi_iort_node *parent;
u32 id;
@@ -475,8 +489,8 @@ static struct acpi_iort_node *iort_node_map_platform_id(
 * as NC (named component) -> SMMU -> ITS. If the type is matched,
 * return the initial dev id and its parent pointer directly.
 */
-   if (!(IORT_TYPE_MASK(parent->type) & type_mask))
-   parent = iort_node_map_id(parent, id, id_out, type_mask);
+   if (!iort_type_matches(parent->type, category))
+   parent = iort_node_map_id(parent, id, id_out, category);
else
if (id_out)
*id_out = id;
-- 
2.24.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [RFC 02/13] ACPI: Add VIOT definitions

2019-11-22 Thread Jean-Philippe Brucker
This is temporary, until the VIOT table is published and these
definitions added to ACPICA.

Signed-off-by: Jean-Philippe Brucker 
---
 include/acpi/actbl2.h | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index e45ced27f4c3..99c1d747e9d8 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -25,6 +25,7 @@
  * the wrong signature.
  */
 #define ACPI_SIG_IORT   "IORT" /* IO Remapping Table */
+#define ACPI_SIG_VIOT   "VIOT" /* Virtual I/O Table */
 #define ACPI_SIG_IVRS   "IVRS" /* I/O Virtualization Reporting 
Structure */
 #define ACPI_SIG_LPIT   "LPIT" /* Low Power Idle Table */
 #define ACPI_SIG_MADT   "APIC" /* Multiple APIC Description Table */
@@ -412,6 +413,36 @@ struct acpi_ivrs_memory {
u64 memory_length;
 };
 
+/***
+ *
+ * VIOT - Virtual I/O Table
+ *Version 1
+ *
+ 
**/
+
+struct acpi_table_viot {
+   struct acpi_table_header header;
+   u8 reserved[12];
+   struct acpi_table_header base_table;
+};
+
+#define ACPI_VIOT_IORT_NODE_VIRTIO_PCI_IOMMU0x80
+#define ACPI_VIOT_IORT_NODE_VIRTIO_MMIO_IOMMU   0x81
+
+struct acpi_viot_iort_virtio_pci_iommu {
+   u32 devid;
+};
+
+struct acpi_viot_iort_virtio_mmio_iommu {
+   u64 base_address;
+   u64 span;
+   u64 flags;
+   u64 interrupt;
+};
+
+/* FIXME: rename this monstrosity. */
+#define ACPI_VIOT_IORT_VIRTIO_MMIO_IOMMU_CACHE_COHERENT (1<<0)
+
 
/***
  *
  * LPIT - Low Power Idle Table
-- 
2.24.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [RFC 00/13] virtio-iommu on non-devicetree platforms

2019-11-22 Thread Jean-Philippe Brucker
I'm seeking feedback on multi-platform support for virtio-iommu. At the
moment only devicetree (DT) is supported and we don't have a pleasant
solution for other platforms. Once we figure out the topology
description, x86 support is trivial.

Since the IOMMU manages memory accesses from other devices, the guest
kernel needs to initialize the IOMMU before endpoints start issuing DMA.
It's a solved problem: firmware or hypervisor describes through DT or
ACPI tables the device dependencies, and probe of endpoints is deferred
until the IOMMU is probed. But:

(1) ACPI has one table per vendor (DMAR for Intel, IVRS for AMD and IORT
for Arm). From my point of view IORT is easier to extend, since we
just need to introduce a new node type. There are no dependencies to
Arm in the Linux IORT driver, so it works well with CONFIG_X86.

However, there are concerns about other OS vendors feeling obligated
to implement this new node, so Arm proposed introducing another ACPI
table, that can wrap any of DMAR, IVRS and IORT to extend it with
new virtual nodes. A draft of this VIOT table specification is
available at http://jpbrucker.net/virtio-iommu/viot/viot-v5.pdf

I'm afraid this could increase fragmentation as guests would need to
implement or modify their support for all of DMAR, IVRS and IORT. If
we end up doing VIOT, I suggest limiting it to IORT.

(2) In addition, there are some concerns about having virtio depend on
ACPI or DT. Some hypervisors (Firecracker, QEMU microvm, kvmtool x86
[1]) don't currently implement those methods.

It was suggested to embed the topology description into the device.
It can work, as demonstrated at the end of this RFC, with the
following limitations:

- The topology description must be read before any endpoint managed
  by the IOMMU is probed, and even before the virtio module is
  loaded. This RFC uses a PCI quirk to manually parse the virtio
  configuration. It assumes that all endpoints managed by the IOMMU
  are under this same PCI host.

- I don't have a solution for the virtio-mmio transport at the
  moment, because I haven't had time to modify a host to test it. I
  think it could either use a notifier on the platform bus, or
  better, a new 'iommu' command-line argument to the virtio-mmio
  driver. So the current prototype doesn't work for firecracker and
  microvm, which rely on virtio-mmio.

- For Arm, if the platform has an ITS, the hypervisor needs IORT or
  DT to describe it anyway. More generally, not using either ACPI or
  DT might prevent from supporting other features as well. I suspect
  the above users will have to implement a standard method sooner or
  later.

- Even when reusing as much existing code as possible, guest support
  is still going to be around a few hundred lines since we can't
  rely on the normal virtio infrastructure to be loaded at that
  point. As you can see below, the diffstat for the incomplete
  topology implementation is already bigger than the exhaustive IORT
  support, even when jumping through the VIOT hoop.

So it's a lightweight solution for very specific use-cases, and we
should still support ACPI for the general case. Multi-platform
guests such as Linux will then need to support three topology
descriptions instead of two.

In this RFC I present both solutions, but I'd rather not keep all of it.
Please see the individual patches for details:

(1) Patches 1, 3-10 add support for virtio-iommu to the Linux IORT
driver and patches 2, 11 add the VIOT glue.

(2) Patch 12 adds the built-in topology description to the virtio-iommu
specification. Patch 13 is a partial implementation for the Linux
virtio-iommu driver. It only supports PCI, not platform devices.

You can find Linux and QEMU code on my virtio-iommu/devel branches at
http://jpbrucker.net/git/linux and http://jpbrucker.net/git/qemu


I split the diffstat since there are two independent features. The first
one is for patches 1-11, and the second one for patch 13.

Jean-Philippe Brucker (11):
  ACPI/IORT: Move IORT to the ACPI folder
  ACPI: Add VIOT definitions
  ACPI/IORT: Allow registration of external tables
  ACPI/IORT: Add node categories
  ACPI/IORT: Support VIOT virtio-mmio node
  ACPI/IORT: Support VIOT virtio-pci node
  ACPI/IORT: Defer probe until virtio-iommu-pci has registered a fwnode
  ACPI/IORT: Add callback to update a device's fwnode
  iommu/virtio: Create fwnode if necessary
  iommu/virtio: Update IORT fwnode
  ACPI: Add VIOT table

 MAINTAINERS |   9 +
 drivers/acpi/Kconfig|   7 +
 drivers/acpi/Makefile   |   2 +
 drivers/acpi/arm64/Kconfig  |   3 -
 drivers/acpi/arm64/Makefile |   1 -
 drivers/acpi/bus.c  |   2 +
 drivers/acpi/{arm64 => }/iort.c | 317 ++--
 drivers/acpi/tables.c   | 

Re: [virtio-dev] [PATCH RESEND v4 1/1] Add virtio-iommu device specification

2019-11-20 Thread Jean-Philippe Brucker
On Wed, Nov 20, 2019 at 06:27:34PM +0100, Cornelia Huck wrote:
> > +\begin{lstlisting}
> > +struct virtio_iommu_config {
> > +  le64 page_size_mask;
> > +  struct virtio_iommu_range_64 {
> > +le64 start;
> > +le64 end;
> > +  } input_range;
> > +  struct virtio_iommu_range_32 {
> > +le32 start;
> > +le32 end;
> > +  } domain_range;
> > +  le32 probe_size;
> > +};
> > +\end{lstlisting}
> > +
> > +\drivernormative{\subsubsection}{Device configuration layout}{Device Types 
> > / IOMMU Device / Device configuration layout}
> > +
> > +The driver MUST NOT write to device configuration fields.
> > +
> > +\devicenormative{\subsubsection}{Device configuration layout}{Device Types 
> > / IOMMU Device / Device configuration layout}
> > +
> > +The device SHOULD set \field{padding} to zero.
> 
> I don't see any field named 'padding' -- is that a leftover from an
> earlier version?

Oh right, it is. The previous version had an 1-byte field before
probe_size, so the structure needed 3 bytes of padding. Since we replaced
that field by domain_range in v4, padding is gone.

> If it is, we can probably remove it as a trivial change on top after
> this change went in.

I can also send a new version (after waiting a few days for other
comments), please let me know what you prefer.

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH RESEND v4 1/1] Add virtio-iommu device specification

2019-11-20 Thread Jean-Philippe Brucker
The IOMMU device allows a guest to manage DMA mappings for physical,
emulated and paravirtualized endpoints. Add device description for the
virtio-iommu device and driver. Introduce PROBE, ATTACH, DETACH, MAP and
UNMAP requests, as well as translation error reporting.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/37
Signed-off-by: Jean-Philippe Brucker 
---
 conformance.tex  |  40 ++-
 content.tex  |   1 +
 virtio-iommu.tex | 816 +++
 3 files changed, 855 insertions(+), 2 deletions(-)
 create mode 100644 virtio-iommu.tex

diff --git a/conformance.tex b/conformance.tex
index 0ac58aa..86ccf61 100644
--- a/conformance.tex
+++ b/conformance.tex
@@ -15,14 +15,14 @@ \section{Conformance Targets}\label{sec:Conformance / 
Conformance Targets}
   \begin{itemize}
 \item Clause \ref{sec:Conformance / Driver Conformance}.
 \item One of clauses \ref{sec:Conformance / Driver Conformance / PCI 
Driver Conformance}, \ref{sec:Conformance / Driver Conformance / MMIO Driver 
Conformance} or \ref{sec:Conformance / Driver Conformance / Channel I/O Driver 
Conformance}.
-\item One of clauses \ref{sec:Conformance / Driver Conformance / Network 
Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver 
Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver 
Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver 
Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory 
Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI 
Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input 
Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver 
Conformance} or \ref{sec:Conformance / Driver Conformance / Socket Driver 
Conformance}.
+\item One of clauses \ref{sec:Conformance / Driver Conformance / Network 
Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver 
Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver 
Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver 
Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory 
Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI 
Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input 
Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver 
Conformance}, \ref{sec:Conformance / Driver Conformance / Socket Driver 
Conformance} or \ref{sec:Conformance / Driver Conformance / IOMMU Driver 
Conformance}.
 \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device 
and Transitional Driver Conformance}.
   \end{itemize}
 \item[Device] A device MUST conform to four conformance clauses:
   \begin{itemize}
 \item Clause \ref{sec:Conformance / Device Conformance}.
 \item One of clauses \ref{sec:Conformance / Device Conformance / PCI 
Device Conformance}, \ref{sec:Conformance / Device Conformance / MMIO Device 
Conformance} or \ref{sec:Conformance / Device Conformance / Channel I/O Device 
Conformance}.
-\item One of clauses \ref{sec:Conformance / Device Conformance / Network 
Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device 
Conformance}, \ref{sec:Conformance / Device Conformance / Console Device 
Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device 
Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory 
Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI 
Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input 
Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto Device 
Conformance} or \ref{sec:Conformance / Device Conformance / Socket Device 
Conformance}.
+\item One of clauses \ref{sec:Conformance / Device Conformance / Network 
Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device 
Conformance}, \ref{sec:Conformance / Device Conformance / Console Device 
Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device 
Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory 
Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI 
Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input 
Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto Device 
Conformance}, \ref{sec:Conformance / Device Conformance / Socket Device 
Conformance} or \ref{sec:Conformance / Device Conformance / IOMMU Device 
Conformance}.
 \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device 
and Transitional Driver Conformance}.
   \end{itemize}
 \end{description}
@@ -183,6 +183,24 @@ \section{Conformance Targets}\label{sec:Conformance / 
Conformance Targets}
 \item \ref{drivernormative:Device Types / Socket Device / Device Operation / 
Device Events}
 \end{itemize

[virtio-dev] [PATCH RESEND v4 0/1] Add virtio-iommu specification

2019-11-20 Thread Jean-Philippe Brucker
Hi,

This is the base virtio-iommu specification. There wasn't any comment on
v4, but I have been away for a while so I'm resending the patch rebased
onto the current master.

The driver implementation has been merged in Linux, and the QEMU device
should follow soon. The next step is proper multiplatform support, for
which I'll send a proposal shortly. Then I'll work more on PASID and PRI
support, and some much needed optimizations.

I would like to request a vote for
https://github.com/oasis-tcs/virtio-spec/issues/37

Thanks,
Jean

Jean-Philippe Brucker (1):
  Add virtio-iommu device specification

 conformance.tex  |  40 ++-
 content.tex  |   1 +
 virtio-iommu.tex | 816 +++
 3 files changed, 855 insertions(+), 2 deletions(-)
 create mode 100644 virtio-iommu.tex

-- 
2.24.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH] snd: Add virtio sound device specification

2019-11-12 Thread Jean-Philippe Brucker
On Tue, Nov 12, 2019 at 02:20:44PM +, Liam Girdwood wrote:
> > Other virtio devices rely on the buffer length already provided by
> > virtio
> > for this. For example see the virtio-net control virtqueue
> > (virtio_net_ctrl*), which is extended with new commands from time to
> > time.
> > Most recently, the "receive-side scaling" feature introduces a large
> > structure as payload to virtio_net_ctrl.
> > 
> 
> Sorry, not following, what does this have to do with combining type and
> size into request_code ? Or are you saying the size is encoded
> elsewhere in the virtio transport data ?

Yes, the buffer size is encoded in the transport data.

[...]
> This would be a good improvement, it's less copying and would likely
> improve user experience, however the buffer ptr still suffers latency
> as it's queued (same for stream positions going the other way).

Right if the queuing overhead is still too large, then I don't think the
current virtqueues can be used.

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH v2] snd: Add virtio sound device specification

2019-11-12 Thread Jean-Philippe Brucker
Hi Mikhail,

On Tue, Nov 05, 2019 at 04:06:06PM +0100, Mikhail Golubev wrote:
> From: Anton Yakovlev 
> 
> Hi everyone,
> 
> the patch proposes an updated version of virtio specification for a new
> virtio sound device. The initial PATCH can be found here: [1]. The
> device may be useful in case when having audio is required but a device
> passthrough or emulation is not an option.
> 
> The virtio sound card supports output and input PCM streams. It reports
> stream capabilities through device configuration space and utilizes two
> virtqueues: one for control requests and one for I/O requests.

Did you send out the implementation prototypes as well?  It might help
understand how the device works.

[...]
> +\subsection{Virtqueues}\label{sec:Device Types / Sound Device / Virtqueues}
> +
> +\begin{description}
> +\item[0] controlq
> +\item[1] pcmq
> +\end{description}
> +
> +The controlq virtqueue always exists, the pcmq virtqueue only exists if
> +the VIRTIO_SND_F_PCM_OUTPUT and/or VIRTIO_SND_F_PCM_INPUT feature is 
> negotiated.

Why not use two queues, one for input and one for output?  I think most
virtio devices do it that way. It could simplify the implementations,
shave 32 bits off the buffers, and allow to quiesce notifications for
input and output streams independently. And in the future you can extend
the device with more than two streams.

> +
> +\subsection{Feature bits}\label{sec:Device Types / Sound Device / Feature 
> bits}
> +
> +\begin{description}
> +\item[VIRTIO_SND_F_PCM_OUTPUT (0)] Output PCM stream support.
> +\item[VIRTIO_SND_F_PCM_INPUT (1)] Input PCM stream support.
> +\end{description}
> +
> +\subsection{Device configuration layout}\label{sec:Device Types / Sound 
> Device / Device configuration layout}
> +
> +\begin{lstlisting}
> +/* supported PCM sample formats */
> +enum {
> +VIRTIO_SND_PCM_FMT_S8 = 0,
> +VIRTIO_SND_PCM_FMT_U8,
> +VIRTIO_SND_PCM_FMT_S16,
> +VIRTIO_SND_PCM_FMT_U16,
> +VIRTIO_SND_PCM_FMT_S32,
> +VIRTIO_SND_PCM_FMT_U32,
> +VIRTIO_SND_PCM_FMT_FLOAT,
> +VIRTIO_SND_PCM_FMT_FLOAT64
> +};

Is it worth adding a word or two about the encoding used, in particular
that this is linear quantization?  Or maybe it's standard enough, I don't
know much about audio.

> +
> +/* supported PCM frame rates */
> +enum {
> +VIRTIO_SND_PCM_RATE_5512 = 0,
> +VIRTIO_SND_PCM_RATE_8000,
> +VIRTIO_SND_PCM_RATE_11025,
> +VIRTIO_SND_PCM_RATE_16000,
> +VIRTIO_SND_PCM_RATE_22050,
> +VIRTIO_SND_PCM_RATE_32000,
> +VIRTIO_SND_PCM_RATE_44100,
> +VIRTIO_SND_PCM_RATE_48000,
> +VIRTIO_SND_PCM_RATE_64000,
> +VIRTIO_SND_PCM_RATE_88200,
> +VIRTIO_SND_PCM_RATE_96000,
> +VIRTIO_SND_PCM_RATE_176400,
> +VIRTIO_SND_PCM_RATE_192000,
> +VIRTIO_SND_PCM_RATE_384000
> +};
> +
> +/* a PCM stream configuration */
> +struct virtio_pcm_stream_config {
> +u8 channels_min;
> +u8 channels_max;
> +le16 formats; /* 1 << VIRTIO_SND_PCM_FMT_XXX */
> +le16 rates; /* 1 << VIRTIO_SND_PCM_RATE_XXX */
> +u16 padding;

You could add a lot more padding, because the current layout doesn't allow
to extend virtio_pcm_stream_config much (an old driver will expect the
pcm.input field to start at offset 0x8 of the config space).

> +};
> +
> +/* a device configuration space */
> +struct virtio_snd_config {
> +struct virtio_pcm_config {
> +struct virtio_pcm_stream_config output;
> +struct virtio_pcm_stream_config input;
> +} pcm;
> +};
> +\end{lstlisting}
> +
> +\subsubsection{Device configuration fields}
> +
> +The \field{pcm.output} and \field{pcm.input} fields contain PCM stream
> +configuration:
> +
> +\begin{description}
> +\item[\field{channels_min}] (driver-read-only) is a minimum number of 
> supported
> +channels.
> +\item[\field{channels_max}] (driver-read-only) is a maximum number of 
> supported
> +channels.
> +\item[\field{formats}] (driver-read-only) is a supported sample format bit 
> map.
> +\item[\field{rates}] (driver-read-only) is a supported frame rate bit map.
> +\end{description}
> +
> +Only interleaved samples are supported.

Similar noob remark: I guess this is standard in audio hardware and
doesn't need more explanation?  If the term is alsa-specific then we'd
need a few words about what the format looks like.

> +
> +\subsection{Device Initialization}
> +
> +\begin{enumerate}
> +\item If the VIRTIO_SND_F_PCM_OUTPUT feature is negotiated, 
> \field{pcm.output}
> +contains the valid output PCM stream configuration.
> +\item If the VIRTIO_SND_F_PCM_INPUT feature is negotiated, \field{pcm.input}
> +contains the valid input PCM stream configuration.
> +\end{enumerate}
> +
> +\devicenormative{\subsubsection}{Device Initialization}{Device Types / Sound 
> Device / Device Initialization}
> +
> +\begin{enumerate}
> +\item The device MUST NOT set the not defined format and rate bits.

s/not defined/undefined/ (here and elsewhere)

> +\item The device MUST initialize padding bytes 

Re: [virtio-dev] [PATCH] snd: Add virtio sound device specification

2019-11-12 Thread Jean-Philippe Brucker
Hi,

On Mon, Nov 11, 2019 at 03:20:55PM +, Liam Girdwood wrote:
> > > > +struct virtio_snd_ctl_msg {
> > > > +/* device-read-only data */
> > > > +le32 request_code;
> > > > +u8 request_payload[];
> > > 
> > > How do I know how much data to read here ? Is size embedded in
> > > request_code ?
> > 
> > Yes, you are right. Actual request/response length can be easily
> > derived from 
> > a request_code.
> > 
> 
> Nak, please use a separate size and type code otherwise we will have
> real problems in the future when we come to add new features (with new
> codes or bigger sizes).

Other virtio devices rely on the buffer length already provided by virtio
for this. For example see the virtio-net control virtqueue
(virtio_net_ctrl*), which is extended with new commands from time to time.
Most recently, the "receive-side scaling" feature introduces a large
structure as payload to virtio_net_ctrl.

[...]
> > > > +\begin{lstlisting}
> > > > +struct virtio_snd_pcm_xfer {
> > > > +le32 stream;
> > > > +u8 data[];
> > > > +le32 status;
> > > > +le32 actual_length;
> > > 
> > > Not following this, is actual_length the size of data[]. If so, it
> > > must
> > > preceed data[].
> > 
> > No, the actual_length field is supposed to be used by device side to
> > report
> > actual amount of bytes read from/written to a buffer.

By the way the written size is already provided by virtio (field len in
the used descriptor), but not the read size.

> > In real world
> > scenario, 
> > if an I/O request contains N bytes, a device can play/capture *up to*
> > N bytes. 
> > Thus, it's required to report this length back to a driver.
> > 
> 
> This is not how PCM audio buffering works. I don't copy extra padding
> if I dont need to, I only copy when my buffer is full. See the need for
> period size in earlier comments.
[...]
> > > 2) Zero copy of audio data and stream positions. Essential for any
> > > low
> > > latency audio use cases (if supported by HW, hypervisor, VM
> > > framework).
> > 
> > What do you mean by zero-copy? Shared memory between device and
> > driver sides?
> 
> Yes, but I see this is now a separate thread as it used by others too.

The virtio transport already permits zero-copy, because one buffer can be
described with a chain of descriptors, each pointing to a different area
in memory. The driver can split the virtio_snd_pcm_xfer into header and
data:

virtqueue
desc ring
|   |
+---+
|   |-> header (copied)
+- - - -+
|   |-> PCM data (mapped)
+---+
|   |

So it might be a good idea to put all the metadata (stream, status,
actual_length) at the beginning of the virtio_snd_pcm_xfer struct,
otherwise the driver will need three descriptors instead of two to achieve
this.

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [PATCH v8 2/7] dt-bindings: virtio: Add virtio-pci-iommu node

2019-05-31 Thread Jean-Philippe Brucker
On 30/05/2019 18:45, Michael S. Tsirkin wrote:
> On Thu, May 30, 2019 at 06:09:24PM +0100, Jean-Philippe Brucker wrote:
>> Some systems implement virtio-iommu as a PCI endpoint. The operating
>> system needs to discover the relationship between IOMMU and masters long
>> before the PCI endpoint gets probed. Add a PCI child node to describe the
>> virtio-iommu device.
>>
>> The virtio-pci-iommu is conceptually split between a PCI programming
>> interface and a translation component on the parent bus. The latter
>> doesn't have a node in the device tree. The virtio-pci-iommu node
>> describes both, by linking the PCI endpoint to "iommus" property of DMA
>> master nodes and to "iommu-map" properties of bus nodes.
>>
>> Reviewed-by: Rob Herring 
>> Reviewed-by: Eric Auger 
>> Signed-off-by: Jean-Philippe Brucker 
> 
> So this is just an example right?
> We are not defining any new properties or anything like that.

Yes it's just an example. The properties already exist but it's good to
describe how to put them together for this particular case, because
there isn't a precedent describing the topology for an IOMMU that
appears on the PCI bus.

> I think down the road for non dt platforms we want to put this
> info in the config space of the device. I do not think ACPI
> is the best option for this since not all systems have it.
> But that can wait.

There is the probe order problem - PCI needs this info before starting
to probe devices on the bus. Maybe we could store the info in a separate
memory region, that is referenced on the command-line and that the guest
can read early.

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v8 4/7] PCI: OF: Initialize dev->fwnode appropriately

2019-05-30 Thread Jean-Philippe Brucker
For PCI devices that have an OF node, set the fwnode as well. This way
drivers that rely on fwnode don't need the special case described by
commit f94277af03ea ("of/platform: Initialise dev->fwnode appropriately").

Acked-by: Bjorn Helgaas 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/pci/of.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index 73d5adec0a28..c4f1b5507b40 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -22,12 +22,15 @@ void pci_set_of_node(struct pci_dev *dev)
return;
dev->dev.of_node = of_pci_find_child_device(dev->bus->dev.of_node,
dev->devfn);
+   if (dev->dev.of_node)
+   dev->dev.fwnode = >dev.of_node->fwnode;
 }
 
 void pci_release_of_node(struct pci_dev *dev)
 {
of_node_put(dev->dev.of_node);
dev->dev.of_node = NULL;
+   dev->dev.fwnode = NULL;
 }
 
 void pci_set_bus_of_node(struct pci_bus *bus)
@@ -42,12 +45,15 @@ void pci_set_bus_of_node(struct pci_bus *bus)
bus->self->untrusted = true;
}
bus->dev.of_node = node;
+   if (node)
+   bus->dev.fwnode = >fwnode;
 }
 
 void pci_release_bus_of_node(struct pci_bus *bus)
 {
of_node_put(bus->dev.of_node);
bus->dev.of_node = NULL;
+   bus->dev.fwnode = NULL;
 }
 
 struct device_node * __weak pcibios_get_phb_of_node(struct pci_bus *bus)
-- 
2.21.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v8 6/7] iommu/virtio: Add probe request

2019-05-30 Thread Jean-Philippe Brucker
When the device offers the probe feature, send a probe request for each
device managed by the IOMMU. Extract RESV_MEM information. When we
encounter a MSI doorbell region, set it up as a IOMMU_RESV_MSI region.
This will tell other subsystems that there is no need to map the MSI
doorbell in the virtio-iommu, because MSIs bypass it.

Acked-by: Joerg Roedel 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c  | 157 --
 include/uapi/linux/virtio_iommu.h |  36 +++
 2 files changed, 187 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index b2719a87c3c5..5d4947c47420 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -49,6 +49,7 @@ struct viommu_dev {
u32 last_domain;
/* Supported MAP flags */
u32 map_flags;
+   u32 probe_size;
 };
 
 struct viommu_mapping {
@@ -71,8 +72,10 @@ struct viommu_domain {
 };
 
 struct viommu_endpoint {
+   struct device   *dev;
struct viommu_dev   *viommu;
struct viommu_domain*vdomain;
+   struct list_headresv_regions;
 };
 
 struct viommu_request {
@@ -125,6 +128,9 @@ static off_t viommu_get_write_desc_offset(struct viommu_dev 
*viommu,
 {
size_t tail_size = sizeof(struct virtio_iommu_req_tail);
 
+   if (req->type == VIRTIO_IOMMU_T_PROBE)
+   return len - viommu->probe_size - tail_size;
+
return len - tail_size;
 }
 
@@ -399,6 +405,110 @@ static int viommu_replay_mappings(struct viommu_domain 
*vdomain)
return ret;
 }
 
+static int viommu_add_resv_mem(struct viommu_endpoint *vdev,
+  struct virtio_iommu_probe_resv_mem *mem,
+  size_t len)
+{
+   size_t size;
+   u64 start64, end64;
+   phys_addr_t start, end;
+   struct iommu_resv_region *region = NULL;
+   unsigned long prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
+
+   start = start64 = le64_to_cpu(mem->start);
+   end = end64 = le64_to_cpu(mem->end);
+   size = end64 - start64 + 1;
+
+   /* Catch any overflow, including the unlikely end64 - start64 + 1 = 0 */
+   if (start != start64 || end != end64 || size < end64 - start64)
+   return -EOVERFLOW;
+
+   if (len < sizeof(*mem))
+   return -EINVAL;
+
+   switch (mem->subtype) {
+   default:
+   dev_warn(vdev->dev, "unknown resv mem subtype 0x%x\n",
+mem->subtype);
+   /* Fall-through */
+   case VIRTIO_IOMMU_RESV_MEM_T_RESERVED:
+   region = iommu_alloc_resv_region(start, size, 0,
+IOMMU_RESV_RESERVED);
+   break;
+   case VIRTIO_IOMMU_RESV_MEM_T_MSI:
+   region = iommu_alloc_resv_region(start, size, prot,
+IOMMU_RESV_MSI);
+   break;
+   }
+   if (!region)
+   return -ENOMEM;
+
+   list_add(>resv_regions, >list);
+   return 0;
+}
+
+static int viommu_probe_endpoint(struct viommu_dev *viommu, struct device *dev)
+{
+   int ret;
+   u16 type, len;
+   size_t cur = 0;
+   size_t probe_len;
+   struct virtio_iommu_req_probe *probe;
+   struct virtio_iommu_probe_property *prop;
+   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+   struct viommu_endpoint *vdev = fwspec->iommu_priv;
+
+   if (!fwspec->num_ids)
+   return -EINVAL;
+
+   probe_len = sizeof(*probe) + viommu->probe_size +
+   sizeof(struct virtio_iommu_req_tail);
+   probe = kzalloc(probe_len, GFP_KERNEL);
+   if (!probe)
+   return -ENOMEM;
+
+   probe->head.type = VIRTIO_IOMMU_T_PROBE;
+   /*
+* For now, assume that properties of an endpoint that outputs multiple
+* IDs are consistent. Only probe the first one.
+*/
+   probe->endpoint = cpu_to_le32(fwspec->ids[0]);
+
+   ret = viommu_send_req_sync(viommu, probe, probe_len);
+   if (ret)
+   goto out_free;
+
+   prop = (void *)probe->properties;
+   type = le16_to_cpu(prop->type) & VIRTIO_IOMMU_PROBE_T_MASK;
+
+   while (type != VIRTIO_IOMMU_PROBE_T_NONE &&
+  cur < viommu->probe_size) {
+   len = le16_to_cpu(prop->length) + sizeof(*prop);
+
+   switch (type) {
+   case VIRTIO_IOMMU_PROBE_T_RESV_MEM:
+   ret = viommu_add_resv_mem(vdev, (void *)prop, len);
+   break;
+   default:
+   dev_err(dev, "unknown viommu prop 0x%x\n", type);
+  

[virtio-dev] [PATCH v8 7/7] iommu/virtio: Add event queue

2019-05-30 Thread Jean-Philippe Brucker
The event queue offers a way for the device to report access faults from
endpoints. It is implemented on virtqueue #1. Whenever the host needs to
signal a fault, it fills one of the buffers offered by the guest and
interrupts it.

Acked-by: Joerg Roedel 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c  | 115 +++---
 include/uapi/linux/virtio_iommu.h |  19 +
 2 files changed, 125 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 5d4947c47420..2688cdcac6e5 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -29,7 +29,8 @@
 #define MSI_IOVA_LENGTH0x10
 
 #define VIOMMU_REQUEST_VQ  0
-#define VIOMMU_NR_VQS  1
+#define VIOMMU_EVENT_VQ1
+#define VIOMMU_NR_VQS  2
 
 struct viommu_dev {
struct iommu_device iommu;
@@ -41,6 +42,7 @@ struct viommu_dev {
struct virtqueue*vqs[VIOMMU_NR_VQS];
spinlock_t  request_lock;
struct list_headrequests;
+   void*evts;
 
/* Device configuration */
struct iommu_domain_geometrygeometry;
@@ -86,6 +88,15 @@ struct viommu_request {
charbuf[];
 };
 
+#define VIOMMU_FAULT_RESV_MASK 0xff00
+
+struct viommu_event {
+   union {
+   u32 head;
+   struct virtio_iommu_fault fault;
+   };
+};
+
 #define to_viommu_domain(domain)   \
container_of(domain, struct viommu_domain, domain)
 
@@ -509,6 +520,68 @@ static int viommu_probe_endpoint(struct viommu_dev 
*viommu, struct device *dev)
return ret;
 }
 
+static int viommu_fault_handler(struct viommu_dev *viommu,
+   struct virtio_iommu_fault *fault)
+{
+   char *reason_str;
+
+   u8 reason   = fault->reason;
+   u32 flags   = le32_to_cpu(fault->flags);
+   u32 endpoint= le32_to_cpu(fault->endpoint);
+   u64 address = le64_to_cpu(fault->address);
+
+   switch (reason) {
+   case VIRTIO_IOMMU_FAULT_R_DOMAIN:
+   reason_str = "domain";
+   break;
+   case VIRTIO_IOMMU_FAULT_R_MAPPING:
+   reason_str = "page";
+   break;
+   case VIRTIO_IOMMU_FAULT_R_UNKNOWN:
+   default:
+   reason_str = "unknown";
+   break;
+   }
+
+   /* TODO: find EP by ID and report_iommu_fault */
+   if (flags & VIRTIO_IOMMU_FAULT_F_ADDRESS)
+   dev_err_ratelimited(viommu->dev, "%s fault from EP %u at %#llx 
[%s%s%s]\n",
+   reason_str, endpoint, address,
+   flags & VIRTIO_IOMMU_FAULT_F_READ ? "R" : 
"",
+   flags & VIRTIO_IOMMU_FAULT_F_WRITE ? "W" : 
"",
+   flags & VIRTIO_IOMMU_FAULT_F_EXEC ? "X" : 
"");
+   else
+   dev_err_ratelimited(viommu->dev, "%s fault from EP %u\n",
+   reason_str, endpoint);
+   return 0;
+}
+
+static void viommu_event_handler(struct virtqueue *vq)
+{
+   int ret;
+   unsigned int len;
+   struct scatterlist sg[1];
+   struct viommu_event *evt;
+   struct viommu_dev *viommu = vq->vdev->priv;
+
+   while ((evt = virtqueue_get_buf(vq, )) != NULL) {
+   if (len > sizeof(*evt)) {
+   dev_err(viommu->dev,
+   "invalid event buffer (len %u != %zu)\n",
+   len, sizeof(*evt));
+   } else if (!(evt->head & VIOMMU_FAULT_RESV_MASK)) {
+   viommu_fault_handler(viommu, >fault);
+   }
+
+   sg_init_one(sg, evt, sizeof(*evt));
+   ret = virtqueue_add_inbuf(vq, sg, 1, evt, GFP_ATOMIC);
+   if (ret)
+   dev_err(viommu->dev, "could not add event buffer\n");
+   }
+
+   virtqueue_kick(vq);
+}
+
 /* IOMMU API */
 
 static struct iommu_domain *viommu_domain_alloc(unsigned type)
@@ -895,16 +968,35 @@ static struct iommu_ops viommu_ops = {
 static int viommu_init_vqs(struct viommu_dev *viommu)
 {
struct virtio_device *vdev = dev_to_virtio(viommu->dev);
-   const char *name = "request";
-   void *ret;
+   const char *names[] = { "request", "event" };
+   vq_callback_t *callbacks[] = {
+   NULL, /* No async requests */
+   viommu_event_handler,
+   };
 
-   ret = virtio_find_single_vq(vde

[virtio-dev] [PATCH v8 2/7] dt-bindings: virtio: Add virtio-pci-iommu node

2019-05-30 Thread Jean-Philippe Brucker
Some systems implement virtio-iommu as a PCI endpoint. The operating
system needs to discover the relationship between IOMMU and masters long
before the PCI endpoint gets probed. Add a PCI child node to describe the
virtio-iommu device.

The virtio-pci-iommu is conceptually split between a PCI programming
interface and a translation component on the parent bus. The latter
doesn't have a node in the device tree. The virtio-pci-iommu node
describes both, by linking the PCI endpoint to "iommus" property of DMA
master nodes and to "iommu-map" properties of bus nodes.

Reviewed-by: Rob Herring 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 .../devicetree/bindings/virtio/iommu.txt  | 66 +++
 1 file changed, 66 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/virtio/iommu.txt

diff --git a/Documentation/devicetree/bindings/virtio/iommu.txt 
b/Documentation/devicetree/bindings/virtio/iommu.txt
new file mode 100644
index ..2407fea0651c
--- /dev/null
+++ b/Documentation/devicetree/bindings/virtio/iommu.txt
@@ -0,0 +1,66 @@
+* virtio IOMMU PCI device
+
+When virtio-iommu uses the PCI transport, its programming interface is
+discovered dynamically by the PCI probing infrastructure. However the
+device tree statically describes the relation between IOMMU and DMA
+masters. Therefore, the PCI root complex that hosts the virtio-iommu
+contains a child node representing the IOMMU device explicitly.
+
+Required properties:
+
+- compatible:  Should be "virtio,pci-iommu"
+- reg: PCI address of the IOMMU. As defined in the PCI Bus
+   Binding reference [1], the reg property is a five-cell
+   address encoded as (phys.hi phys.mid phys.lo size.hi
+   size.lo). phys.hi should contain the device's BDF as
+   0b  dfff . The other cells
+   should be zero.
+- #iommu-cells:Each platform DMA master managed by the IOMMU is 
assigned
+   an endpoint ID, described by the "iommus" property [2].
+   For virtio-iommu, #iommu-cells must be 1.
+
+Notes:
+
+- DMA from the IOMMU device isn't managed by another IOMMU. Therefore the
+  virtio-iommu node doesn't have an "iommus" property, and is omitted from
+  the iommu-map property of the root complex.
+
+Example:
+
+pcie@1000 {
+   compatible = "pci-host-ecam-generic";
+   ...
+
+   /* The IOMMU programming interface uses slot 00:01.0 */
+   iommu0: iommu@0008 {
+   compatible = "virtio,pci-iommu";
+   reg = <0x0800 0 0 0 0>;
+   #iommu-cells = <1>;
+   };
+
+   /*
+* The IOMMU manages all functions in this PCI domain except
+* itself. Omit BDF 00:01.0.
+*/
+   iommu-map = <0x0  0x0 0x8>
+   <0x9  0x9 0xfff7>;
+};
+
+pcie@2000 {
+   compatible = "pci-host-ecam-generic";
+   ...
+   /*
+* The IOMMU also manages all functions from this domain,
+* with endpoint IDs 0x1 - 0x1
+*/
+   iommu-map = <0x0  0x1 0x1>;
+};
+
+ethernet@fe001000 {
+   ...
+   /* The IOMMU manages this platform device with endpoint ID 0x2 */
+   iommus = < 0x2>;
+};
+
+[1] Documentation/devicetree/bindings/pci/pci.txt
+[2] Documentation/devicetree/bindings/iommu/iommu.txt
-- 
2.21.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v8 5/7] iommu: Add virtio-iommu driver

2019-05-30 Thread Jean-Philippe Brucker
The virtio IOMMU is a para-virtualized device, allowing to send IOMMU
requests such as map/unmap over virtio transport without emulating page
tables. This implementation handles ATTACH, DETACH, MAP and UNMAP
requests.

The bulk of the code transforms calls coming from the IOMMU API into
corresponding virtio requests. Mappings are kept in an interval tree
instead of page tables. A little more work is required for modular and x86
support, so for the moment the driver depends on CONFIG_VIRTIO=y and
CONFIG_ARM64.

Acked-by: Joerg Roedel 
Signed-off-by: Jean-Philippe Brucker 
---
 MAINTAINERS   |   7 +
 drivers/iommu/Kconfig |  11 +
 drivers/iommu/Makefile|   1 +
 drivers/iommu/virtio-iommu.c  | 934 ++
 include/uapi/linux/virtio_ids.h   |   1 +
 include/uapi/linux/virtio_iommu.h | 110 
 6 files changed, 1064 insertions(+)
 create mode 100644 drivers/iommu/virtio-iommu.c
 create mode 100644 include/uapi/linux/virtio_iommu.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 429c6c624861..62bd1834d95a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16807,6 +16807,13 @@ S: Maintained
 F: drivers/virtio/virtio_input.c
 F: include/uapi/linux/virtio_input.h
 
+VIRTIO IOMMU DRIVER
+M: Jean-Philippe Brucker 
+L: virtualizat...@lists.linux-foundation.org
+S: Maintained
+F: drivers/iommu/virtio-iommu.c
+F: include/uapi/linux/virtio_iommu.h
+
 VIRTUAL BOX GUEST DEVICE DRIVER
 M: Hans de Goede 
 M: Arnd Bergmann 
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 83664db5221d..e15cdcd8cb3c 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -473,4 +473,15 @@ config HYPERV_IOMMU
  Stub IOMMU driver to handle IRQs as to allow Hyper-V Linux
  guests to run with x2APIC mode enabled.
 
+config VIRTIO_IOMMU
+   bool "Virtio IOMMU driver"
+   depends on VIRTIO=y
+   depends on ARM64
+   select IOMMU_API
+   select INTERVAL_TREE
+   help
+ Para-virtualised IOMMU driver with virtio.
+
+ Say Y here if you intend to run this kernel as a guest.
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 8c71a15e986b..f13f36ae1af6 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -33,3 +33,4 @@ obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
 obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
 obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
 obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
+obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
new file mode 100644
index ..b2719a87c3c5
--- /dev/null
+++ b/drivers/iommu/virtio-iommu.c
@@ -0,0 +1,934 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtio driver for the paravirtualized IOMMU
+ *
+ * Copyright (C) 2019 Arm Limited
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#define MSI_IOVA_BASE  0x800
+#define MSI_IOVA_LENGTH0x10
+
+#define VIOMMU_REQUEST_VQ  0
+#define VIOMMU_NR_VQS  1
+
+struct viommu_dev {
+   struct iommu_device iommu;
+   struct device   *dev;
+   struct virtio_device*vdev;
+
+   struct ida  domain_ids;
+
+   struct virtqueue*vqs[VIOMMU_NR_VQS];
+   spinlock_t  request_lock;
+   struct list_headrequests;
+
+   /* Device configuration */
+   struct iommu_domain_geometrygeometry;
+   u64 pgsize_bitmap;
+   u32 first_domain;
+   u32 last_domain;
+   /* Supported MAP flags */
+   u32 map_flags;
+};
+
+struct viommu_mapping {
+   phys_addr_t paddr;
+   struct interval_tree_node   iova;
+   u32 flags;
+};
+
+struct viommu_domain {
+   struct iommu_domain domain;
+   struct viommu_dev   *viommu;
+   struct mutexmutex; /* protects viommu pointer */
+   unsigned intid;
+   u32 map_flags;
+
+   spinlock_t  mappings_lock;
+   struct rb_root_cached   mappings;
+
+   unsigned long   nr_endpoints;
+};
+
+struct viommu_endpoint {
+   struct viommu_dev   *viommu;
+   struct viommu_domain*vdomain;
+};
+
+struct viommu_request {
+   struct list_headlist;
+   void*writeback;
+   unsigned int

[virtio-dev] [PATCH v8 1/7] dt-bindings: virtio-mmio: Add IOMMU description

2019-05-30 Thread Jean-Philippe Brucker
The nature of a virtio-mmio node is discovered by the virtio driver at
probe time. However the DMA relation between devices must be described
statically. When a virtio-mmio node is a virtio-iommu device, it needs an
"#iommu-cells" property as specified by bindings/iommu/iommu.txt.

Otherwise, the virtio-mmio device may perform DMA through an IOMMU, which
requires an "iommus" property. Describe these requirements in the
device-tree bindings documentation.

Reviewed-by: Rob Herring 
Reviewed-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
---
 .../devicetree/bindings/virtio/mmio.txt   | 30 +++
 1 file changed, 30 insertions(+)

diff --git a/Documentation/devicetree/bindings/virtio/mmio.txt 
b/Documentation/devicetree/bindings/virtio/mmio.txt
index 5069c1b8e193..21af30fbb81f 100644
--- a/Documentation/devicetree/bindings/virtio/mmio.txt
+++ b/Documentation/devicetree/bindings/virtio/mmio.txt
@@ -8,10 +8,40 @@ Required properties:
 - reg: control registers base address and size including configuration 
space
 - interrupts:  interrupt generated by the device
 
+Required properties for virtio-iommu:
+
+- #iommu-cells:When the node corresponds to a virtio-iommu device, it 
is
+   linked to DMA masters using the "iommus" or "iommu-map"
+   properties [1][2]. #iommu-cells specifies the size of the
+   "iommus" property. For virtio-iommu #iommu-cells must be
+   1, each cell describing a single endpoint ID.
+
+Optional properties:
+
+- iommus:  If the device accesses memory through an IOMMU, it should
+   have an "iommus" property [1]. Since virtio-iommu itself
+   does not access memory through an IOMMU, the "virtio,mmio"
+   node cannot have both an "#iommu-cells" and an "iommus"
+   property.
+
 Example:
 
virtio_block@3000 {
compatible = "virtio,mmio";
reg = <0x3000 0x100>;
interrupts = <41>;
+
+   /* Device has endpoint ID 23 */
+   iommus = < 23>
}
+
+   viommu: iommu@3100 {
+   compatible = "virtio,mmio";
+   reg = <0x3100 0x100>;
+   interrupts = <42>;
+
+   #iommu-cells = <1>
+   }
+
+[1] Documentation/devicetree/bindings/iommu/iommu.txt
+[2] Documentation/devicetree/bindings/pci/pci-iommu.txt
-- 
2.21.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v8 0/7] Add virtio-iommu driver

2019-05-30 Thread Jean-Philippe Brucker
Implement the virtio-iommu driver, following specification v0.12 [1].
Since last version [2] we've worked on improving the specification,
which resulted in the following changes to the interface:
* Remove the EXEC flag.
* Add feature bit for the MMIO flag.
* Change domain_bits to domain_range.

Given that there were small changes to patch 5/7, I removed the review
and test tags. Please find the code at [3].

[1] Virtio-iommu specification v0.12, sources and pdf
git://linux-arm.org/virtio-iommu.git virtio-iommu/v0.12
http://jpbrucker.net/virtio-iommu/spec/v0.12/virtio-iommu-v0.12.pdf

http://jpbrucker.net/virtio-iommu/spec/diffs/virtio-iommu-dev-diff-v0.11-v0.12.pdf

[2] [PATCH v7 0/7] Add virtio-iommu driver

https://lore.kernel.org/linux-pci/0ba215f5-e856-bf31-8dd9-a85710714...@arm.com/T/

[3] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.12
git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.12

Jean-Philippe Brucker (7):
  dt-bindings: virtio-mmio: Add IOMMU description
  dt-bindings: virtio: Add virtio-pci-iommu node
  of: Allow the iommu-map property to omit untranslated devices
  PCI: OF: Initialize dev->fwnode appropriately
  iommu: Add virtio-iommu driver
  iommu/virtio: Add probe request
  iommu/virtio: Add event queue

 .../devicetree/bindings/virtio/iommu.txt  |   66 +
 .../devicetree/bindings/virtio/mmio.txt   |   30 +
 MAINTAINERS   |7 +
 drivers/iommu/Kconfig |   11 +
 drivers/iommu/Makefile|1 +
 drivers/iommu/virtio-iommu.c  | 1176 +
 drivers/of/base.c |   10 +-
 drivers/pci/of.c  |6 +
 include/uapi/linux/virtio_ids.h   |1 +
 include/uapi/linux/virtio_iommu.h |  165 +++
 10 files changed, 1470 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/virtio/iommu.txt
 create mode 100644 drivers/iommu/virtio-iommu.c
 create mode 100644 include/uapi/linux/virtio_iommu.h

-- 
2.21.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v8 3/7] of: Allow the iommu-map property to omit untranslated devices

2019-05-30 Thread Jean-Philippe Brucker
In PCI root complex nodes, the iommu-map property describes the IOMMU that
translates each endpoint. On some platforms, the IOMMU itself is presented
as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). This isn't supported
by the current OF driver, which expects all endpoints to have an IOMMU.
Allow the iommu-map property to have gaps.

Relaxing of_map_rid() also allows the msi-map property to have gaps, which
is invalid since MSIs always reach an MSI controller. In that case
pci_msi_setup_msi_irqs() will return an error when attempting to find the
device's MSI domain.

Reviewed-by: Rob Herring 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/of/base.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 20e0e7ee4edf..55e7f5bb0549 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -2294,8 +2294,12 @@ int of_map_rid(struct device_node *np, u32 rid,
return 0;
}
 
-   pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n",
-   np, map_name, rid, target && *target ? *target : NULL);
-   return -EFAULT;
+   pr_info("%pOF: no %s translation for rid 0x%x on %pOF\n", np, map_name,
+   rid, target && *target ? *target : NULL);
+
+   /* Bypasses translation */
+   if (id_out)
+   *id_out = rid;
+   return 0;
 }
 EXPORT_SYMBOL_GPL(of_map_rid);
-- 
2.21.0


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v4] Add virtio-iommu device specification

2019-05-29 Thread Jean-Philippe Brucker
The IOMMU device allows a guest to manage DMA mappings for physical,
emulated and paravirtualized endpoints. Add device description for the
virtio-iommu device and driver. Introduce PROBE, ATTACH, DETACH, MAP and
UNMAP requests, as well as translation error reporting.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/37
Signed-off-by: Jean-Philippe Brucker 
---
This version addresses comments on v3 [1]:
* remove VIRTIO_IOMMU_MAP_F_EXEC for now,
* add VIRTIO_IOMMU_F_MMIO feature bit,
* change domain_bits -> domain_range,
* and countless wording improvements.

To help with the review, there is a diff since last version at [2] and
the current version at [3]. I'll send an updated version of the driver
to the list this week.

[1] https://lists.oasis-open.org/archives/virtio-comment/201904/msg8.html
[2] 
http://jpbrucker.net/virtio-iommu/spec/diffs/virtio-iommu-dev-diff-v0.11-v0.12.pdf
[3] http://jpbrucker.net/virtio-iommu/spec/virtio-v1.1-wd02-iommu-0.12-draft.pdf
---
 conformance.tex  |  40 ++-
 content.tex  |   1 +
 virtio-iommu.tex | 816 +++
 3 files changed, 855 insertions(+), 2 deletions(-)
 create mode 100644 virtio-iommu.tex

diff --git a/conformance.tex b/conformance.tex
index 42f702a..79a3e7d 100644
--- a/conformance.tex
+++ b/conformance.tex
@@ -15,14 +15,14 @@ \section{Conformance Targets}\label{sec:Conformance / 
Conformance Targets}
   \begin{itemize}
 \item Clause \ref{sec:Conformance / Driver Conformance}.
 \item One of clauses \ref{sec:Conformance / Driver Conformance / PCI 
Driver Conformance}, \ref{sec:Conformance / Driver Conformance / MMIO Driver 
Conformance} or \ref{sec:Conformance / Driver Conformance / Channel I/O Driver 
Conformance}.
-\item One of clauses \ref{sec:Conformance / Driver Conformance / Network 
Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver 
Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver 
Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver 
Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory 
Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI 
Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input 
Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver 
Conformance} or \ref{sec:Conformance / Driver Conformance / Socket Driver 
Conformance}.
+\item One of clauses \ref{sec:Conformance / Driver Conformance / Network 
Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver 
Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver 
Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver 
Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory 
Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI 
Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input 
Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto Driver 
Conformance}, \ref{sec:Conformance / Driver Conformance / Socket Driver 
Conformance} or \ref{sec:Conformance / Driver Conformance / IOMMU Driver 
Conformance}.
 \item Clause \ref{sec:Conformance / Legacy Interface: Transitional Device 
and Transitional Driver Conformance}.
   \end{itemize}
 \item[Device] A device MUST conform to four conformance clauses:
   \begin{itemize}
 \item Clause \ref{sec:Conformance / Device Conformance}.
 \item One of clauses \ref{sec:Conformance / Device Conformance / PCI 
Device Conformance}, \ref{sec:Conformance / Device Conformance / MMIO Device 
Conformance} or \ref{sec:Conformance / Device Conformance / Channel I/O Device 
Conformance}.
-\item One of clauses \ref{sec:Conformance / Device Conformance / Network 
Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device 
Conformance}, \ref{sec:Conformance / Device Conformance / Console Device 
Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device 
Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory 
Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI 
Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input 
Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto Device 
Conformance} or \ref{sec:Conformance / Device Conformance / Socket Device 
Conformance}.
+\item One of clauses \ref{sec:Conformance / Device Conformance / Network 
Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device 
Conformance}, \ref{sec:Conformance / Device Conformance / Console Device 
Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device 
Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory 
Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI 
Host Device Conformance}, \ref{sec:Conformance / Dev

[virtio-dev] Re: [PATCH v7 0/7] Add virtio-iommu driver

2019-05-28 Thread Jean-Philippe Brucker
On 27/05/2019 16:15, Michael S. Tsirkin wrote:
> On Mon, May 27, 2019 at 11:26:04AM +0200, Joerg Roedel wrote:
>> On Sun, May 12, 2019 at 12:31:59PM -0400, Michael S. Tsirkin wrote:
>>> OK this has been in next for a while.
>>>
>>> Last time IOMMU maintainers objected. Are objections
>>> still in force?
>>>
>>> If not could we get acks please?
>>
>> No objections against the code, I only hesitated because the Spec was
>> not yet official.
>>
>> So for the code:
>>
>>  Acked-by: Joerg Roedel 
> 
> Last spec patch had a bunch of comments not yet addressed.
> But I do not remember whether comments are just about wording
> or about the host/guest interface as well.
> Jean-Philippe could you remind me please?

It's mostly wording, but there is a small change in the config space
layout and two new feature bits. I'll send a new version of the driver
when possible.

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [PATCH v3] Add virtio-iommu device specification

2019-05-16 Thread Jean-Philippe Brucker
On 15/05/2019 06:30, Tian, Kevin wrote:
>> From: Tian, Kevin
>> Sent: Wednesday, May 15, 2019 1:08 PM
>> To: Jean-Philippe Brucker ; Michael S.
>> Tsirkin 
>>>
>>>
>>>>>>> +\begin{description}
>>>>>>> +  \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
>>>>>>> +Accesses to virtual addresses in this region have undefined
>> behavior.
>>>>>>
>>>>>> undefined?  how can platforms with such a property coexist with
>>> untrusted
>>>>>> devices?
>>>>>
>>>>> It is "undefined" because I can't enumerate all cases (don't know about
>>>>> all of them). This would contain for example the RMRR regions of VT-d
>>>>> (the BIOS reserving some DMA regions for itself), regions allocated for
>>>>> host-managed MSIs on arm platforms, and some PCI bridge windows
>>>>> (although they removed those from the Linux resv regions recently, not
>>>>> certain why). Ideally we wouldn't have any, but some special cases make
>>>>> it necessary.
> 
> +Alex.
> 
> Actually I don't think RMRR is right example here (sorry if it's my input in
> our previous discussion). There was concern on RMRR providing a shared
> memory region between device and platform controller, thus may cause
> isolation issue:
>   https://access.redhat.com/sites/default/files/attachments/rmrr-wp1.pdf
> 
> So the basic policy for devices with RMRR is, excluding them from device
> assignment, except for Intel GPU and USB. The latter two has well-understood
> behavior on RMRR, thus they can be safely assigned with RMRR ignored
> (i.e. RMRR region can be reused as valid IOVA region).  

Ah I see, seems like a good policy.

> As I commented below, I think it's safe to describe reserved region behavior
> as either succeed with undefined behavior (e.g. in ARM MSI doorbell case, 
> where a spurious interrupt on allocated doorbell for this device is triggered 
> instead of desired DMA access), or cause vIOMMU unrecoverable fault due 
> to invalid mappings. An untrusted device can never use reserved region to 
> escape since physically IOMMU only contains verified mapping to granted
> resource (MSI) or nothing.

Yes, I will add that as a requirement in next version: reserved regions
are okay as long as they don't allow a device to escape isolation.
Finding the right wording without being too vague is a bit tough for
this one.

>>>>>
>>>>> In general having reserved regions is incompatible with untrusted
>>>>> devices, and in some cases incompatible with untrusted guests. But
>>>>> choosing the policy about which devices to present to guest is up to the
>>>>> platform. The host knows what the resv regions are used for, and if they
>>>>> are safe enough. The guest just need to knows what to avoid.
>>>>
>>>> Yes but ... if you trust driver and device then what's the
>>>> point of the iommu?
>>>
>>> Improving memory allocation. The guest can do scatter-gather for large
>>> buffers, instead of allocating big contiguous chunks of guest-physical
>>> addresses. And with device pass-through, any memory used for DMA has to
>>> be pinned (since devices generally don't support page faults). So when
>>> assigning an endpoint to a guest, without a vIOMMU all of the guest
>>> memory has to be pinned upfront. With an IOMMU only the memory that
>> will
>>> actually be used for DMA is pinned.
>>
>> I'd not vote using current vIOMMU just for avoiding pin instead of for
>> isolation purpose... the map/unmap overhead is high except you only
>> talking about static mapping e.g. in guest user space driver (e.g. DPDK)
>> (but then it's actually for isolation purpose between user and kernel).
>>  btw we are working on a lighter-weight approach (aim <5% perf drop
>> for most cases) while allowing host to fully introspect guest DMA activities.
>> Likely we'll introduce a new virtio-iommu capability for this purpose,
>> instead of using current MAP/UNMAP primitives. Stay tuned and we'll
>> send out RFC soon once getting good result. :-)

Sounds great, I'd like to hear more about this. I also have some drafts
for performance improvement and SVA, but haven't found time to progress
on that in a while. For SVA however, introspection seems nearly
impossible (on Arm the host may not even see invalidation commands from
the guest).

Thanks,
Jean

>>
>> And I don't think having reserved regions is incompatible with untru

Re: [virtio-dev] Re: [PATCH v3] Add virtio-iommu device specification

2019-05-14 Thread Jean-Philippe Brucker
On 10/05/2019 17:52, Michael S. Tsirkin wrote:
 +/* Flags are: */
 +#define VIRTIO_IOMMU_MAP_F_READ   (1 << 0)
 +#define VIRTIO_IOMMU_MAP_F_WRITE  (1 << 1)
 +#define VIRTIO_IOMMU_MAP_F_EXEC   (1 << 2)
>>>
>>> what is exec? pls add some comments near all flags.
>>
>> Exec means instruction fetch. Some systems have different transaction
>> flags for read and exec (PCI only supports that in the PASID TLP prefix)
>> and some page tables have a "no exec" bit.
> 
> So let's say I don't set EXEC on such a system. Is EXEC allowed or not?
> 
> Maybe we can ask user to always set EXEC is exec is needed,
> but then say we can not promise instruction fetch will fail
> if exec is not set.

So the semantics could be:
* device sets EXEC feature bit
  - driver sets EXEC flag in MAP, instruction fetch succeeds
  - driver doesn't set EXEC flag in MAP, instructions fetch will fail
* device doesn't set EXEC feature bit, then driver cannot set EXEC flag
but instruction fetch may not fail.

But I'm tempted to just remove the EXEC flag for now, I don't think
anyone needs it at the moment (and VFIO doesn't support it). I'd add it
back as NO_EXEC which is closer to what IOMMUs offer.


 +\begin{description}
 +  \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
 +Accesses to virtual addresses in this region have undefined behavior.
>>>
>>> undefined?  how can platforms with such a property coexist with untrusted
>>> devices?
>>
>> It is "undefined" because I can't enumerate all cases (don't know about
>> all of them). This would contain for example the RMRR regions of VT-d
>> (the BIOS reserving some DMA regions for itself), regions allocated for
>> host-managed MSIs on arm platforms, and some PCI bridge windows
>> (although they removed those from the Linux resv regions recently, not
>> certain why). Ideally we wouldn't have any, but some special cases make
>> it necessary.
>>
>> In general having reserved regions is incompatible with untrusted
>> devices, and in some cases incompatible with untrusted guests. But
>> choosing the policy about which devices to present to guest is up to the
>> platform. The host knows what the resv regions are used for, and if they
>> are safe enough. The guest just need to knows what to avoid.
> 
> Yes but ... if you trust driver and device then what's the
> point of the iommu?

Improving memory allocation. The guest can do scatter-gather for large
buffers, instead of allocating big contiguous chunks of guest-physical
addresses. And with device pass-through, any memory used for DMA has to
be pinned (since devices generally don't support page faults). So when
assigning an endpoint to a guest, without a vIOMMU all of the guest
memory has to be pinned upfront. With an IOMMU only the memory that will
actually be used for DMA is pinned.

> RMRR access basically fails with translation right?
> Not undefined.

>From section 3.15 of VT-d rev3.0, it looks like they are translated

"Requests to these reserved regions may either occur as a result of
operations performed by the system software driver (for example in the
case of DMA from unified memory access (UMA) graphics controllers to
graphics reserved memory), or may be initiated by non system software
(for example in case of DMA performed by a USB controller under BIOS SMM
control for legacy keyboard emulation)."

> What happens with MSIs on ARM?

On Arm platforms it's the IRQ chip rather than the IOMMU that performs
MSI isolation - ensure that an endpoint doesn't trigger MSIs for another
endpoint. The SMMU doesn't differentiate an MSI from a normal write,
unlike x86 which has a special address range. MSI-X tables for a
pass-through device are managed by the host, and a virtual MSI-X table
is presented to the guest. So the host allocates a VA range and use it
to map the MSI doorbell in the SMMU. It then reports that VA range as
reserved to the guest. If the endpoint did write to the region, nothing
bad would happen, it would simply trigger an MSI as if it had triggered
via the MSI-X table. But the guest can't use that region for normal
memory mapping.

 or never even reach it.
>>>
>>> that's ok
>>
>> We could be losing isolation if some bridge is intercepting accesses to
>> a range of addresses.
> 
> hmm sorry not sure I understand.

After reading a bit more about this, I don't think this is relevant here.

This was about lack of ACS isolation allowing endpoints to do p2p by
writing to some specific MMIO region, but we already deal with this in
the ATTACH section - either endpoints are properly isolated with ACS, or
the platform describes them as being in the same IOMMU group and they
cannot be isolated from each other. We don't need to care about reserved
MMIO ranges on top of that.

RESV_MEM is used for MSI and RMRR-style regions for now, so I'll
reformulate this paragraph. I'll also require that accesses to RESV
regions don't affect anything else than the endpoint and the SW that
owns it.


 +The driver SHOULD treat 

Re: [virtio-dev] Re: [PATCH v3] Add virtio-iommu device specification

2019-05-10 Thread Jean-Philippe Brucker
Thanks for taking the time to review it! For brevity I only replied to
questions and the bits I don't completely agree with, I will change the
rest.

On 03/05/2019 18:05, Michael S. Tsirkin wrote:
> On Tue, Apr 30, 2019 at 02:56:48PM +0100, Jean-Philippe Brucker wrote:
>> diff --git a/virtio-iommu.tex b/virtio-iommu.tex
[...]
>> +\item[VIRTIO_IOMMU_F_DOMAIN_BITS (1)]
>> +  The number of domains supported is described in \field{domain_bits}
> 
> I would rename this to domain_range or something like this.
> E.g. there's an option of 0 which is special.

I don't see a use for it right now, but I can make this a range.

>> +\drivernormative{\subsubsection}{Feature bits}{Device Types / IOMMU Device 
>> / Feature bits}
>> +
>> +The driver SHOULD accept any of the VIRTIO_IOMMU_F_INPUT_RANGE,
>> +VIRTIO_IOMMU_F_DOMAIN_BITS, VIRTIO_IOMMU_F_MAP_UNMAP and
>> +VIRTIO_IOMMU_F_PROBE feature bits if offered by the device.
> 
> Why does it have to accept MAP_UNMAP if it does not want to use it?

I was thinking a future version that adds another mode would remove this
statement. I can remove it, it's clear enough that currently the driver
won't get anything if it doesn't accept the bit.

>> +If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all endpoints can
>> +access guest-physical addresses ("bypass mode").
> 
> I think this means actually "all unattached endpoints"?
> 
> 
> So how does this interact with e.g. encrypted memory?
> I think it's actually weaker. This bit IMHO just means that
> all accesses are allowed by the iommu and all addresses
> are translated 1:1.

Yes that's what it means, I'll try to clarify the description. I haven't
looked at memory encryption yet, I guess it's disabled until you attach
a domain, and then you need to pass keys in an extended MAP request.

> Also what about other endpoints? With bypass
> does an endpoint not in any domain get access to all other
> endpoints?

Note that it's guest-physical addresses, not physical memory. If other
devices are memory-mapped, then with bypass an endpoint can access them.

>> If the feature is not
>> +negotiated, then any memory access from endpoints will fault.
> 
> fault -> fail?
> 
> So all accesses are disallowed, right?

Yes (Minus reserved regions in some cases. More below)

I rewrote the paragraph as:
"If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, all accesses from
unattached endpoints are allowed and translated by the IOMMU using the
identity function. If the feature is not negotiated, any memory access
from an unattached endpoint will fail. Upon attaching an endpoint in
bypass mode to a new domain, any memory access from the endpoint fails,
since the domain does not contain any mapping."

>> +
>> +\begin{itemize}
>> +\item \field{page_size_mask} contains the bitmask of all page sizes that
>> +  can be mapped. The least significant bit set defines the page
>> +  granularity of IOMMU mappings.
> 
> so this refers to map/unmap requests only really?

Yes

>> Other bits in the mask are hints
>> +  describing page sizes that the IOMMU can merge into a single mapping
>> +  (page blocks).
> 
> what does this merging refer to? I don't find it anywhere in the spec.

It doesn't refer to anything else in the spec, the current wording isn't
great. When the host stores mappings in page tables, it's useful for the
guest to know which huge page sizes are available. For example if the
physical IOMMU can map 2M and 1G blocks in addition to 4k pages, then
the mask is 0x40201000. The hints allow a guest to optimize DMA
allocation (iommu_dma_alloc() in Linux), in order to minimize the number
of IOTLB entries used in hardware.

I now have the following:
"Other bits in \field{page_size_mask} are hints and describe larger page
sizes that the IOMMU device handles efficiently. For example, when the
device stores mappings using a page table tree, it may be able to
describe large mappings using a few leaf entries in intermediate tables,
rather than using lots of entries in the last level of the tree.
Creating mappings aligned on large page size can improve performance
since they require fewer page tables and TLB entries."

>> +  The smallest page granularity supported by the IOMMU is one byte. It is
>> +  legal for the driver to map one byte at a time if bit 0 of
>> +  \field{page_size_mask} is set.
> 
> With single byte mappings, can't guest quickly exhaust
> device memory? what guidance can we provide for handling
> this kind of issue? how do device and driver avoid it?

The advantage of a single-byte granularity is arbitrary mapping sizes,
it doesn't mean that the driver will use any single-byte mapping. If the
host uses page tables to store mappings then the granularity is
generally 4kB.

Re: [virtio-dev] [PATCH v3] Add virtio-iommu device specification

2019-05-05 Thread Jean-Philippe Brucker
On 03/05/2019 20:47, Michael S. Tsirkin wrote:
> On Tue, Apr 30, 2019 at 03:42:14PM +0100, Jean-Philippe Brucker wrote:
>> On 30/04/2019 14:56, Jean-Philippe Brucker wrote:
>> > The IOMMU device allows a guest to manage DMA mappings for physical,
>> > emulated and paravirtualized endpoints. Add device description for the
>> > virtio-iommu device and driver. Introduce PROBE, ATTACH, DETACH, MAP and
>> > UNMAP requests, as well as translation error reporting.
>> > 
>> > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/37
>> > Signed-off-by: Jean-Philippe Brucker 
>> 
>> I'd like to request a vote for this version (the github issue is now up
>> to date and the patch applies cleanly)
>> 
>> Thanks,
>> Jean
> 
> I sent some comments that might be worth addressing before the vote.  If
> you don't agree I'm fine with starting voting now if you like and we'll
> let the TC decide, pls let me know.

Thanks for the review, I agree with most comments. I'll reply this week
and then send a new version, so you can hold off the vote

Thanks,
Jean

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] [PATCH v3] Add virtio-iommu device specification

2019-05-02 Thread Jean-Philippe Brucker
On 30/04/2019 14:56, Jean-Philippe Brucker wrote:
> The IOMMU device allows a guest to manage DMA mappings for physical,
> emulated and paravirtualized endpoints. Add device description for the
> virtio-iommu device and driver. Introduce PROBE, ATTACH, DETACH, MAP and
> UNMAP requests, as well as translation error reporting.
> 
> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/37
> Signed-off-by: Jean-Philippe Brucker 

I'd like to request a vote for this version (the github issue is now up
to date and the patch applies cleanly)

Thanks,
Jean

> ---
> Since v2 I rebased onto virtio v1.1-wd02, fixing a conflict in
> conformance.tex and using the new \conformance command.
> 
> A PDF version is available at
> https://jpbrucker.net/virtio-iommu/spec/virtio-v1.1-wd02-iommu-0.11-draft.pdf
> ---
>  conformance.tex  |  40 ++-
>  content.tex  |   1 +
>  virtio-iommu.tex | 850 +++
>  3 files changed, 889 insertions(+), 2 deletions(-)
>  create mode 100644 virtio-iommu.tex
> 
> diff --git a/conformance.tex b/conformance.tex
> index 42f702a..79a3e7d 100644
> --- a/conformance.tex
> +++ b/conformance.tex
> @@ -15,14 +15,14 @@ \section{Conformance Targets}\label{sec:Conformance / 
> Conformance Targets}
>\begin{itemize}
>  \item Clause \ref{sec:Conformance / Driver Conformance}.
>  \item One of clauses \ref{sec:Conformance / Driver Conformance / PCI 
> Driver Conformance}, \ref{sec:Conformance / Driver Conformance / MMIO Driver 
> Conformance} or \ref{sec:Conformance / Driver Conformance / Channel I/O 
> Driver Conformance}.
> -\item One of clauses \ref{sec:Conformance / Driver Conformance / Network 
> Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver 
> Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver 
> Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver 
> Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory 
> Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI 
> Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input 
> Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto 
> Driver Conformance} or \ref{sec:Conformance / Driver Conformance / Socket 
> Driver Conformance}.
> +\item One of clauses \ref{sec:Conformance / Driver Conformance / Network 
> Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Block Driver 
> Conformance}, \ref{sec:Conformance / Driver Conformance / Console Driver 
> Conformance}, \ref{sec:Conformance / Driver Conformance / Entropy Driver 
> Conformance}, \ref{sec:Conformance / Driver Conformance / Traditional Memory 
> Balloon Driver Conformance}, \ref{sec:Conformance / Driver Conformance / SCSI 
> Host Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Input 
> Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Crypto 
> Driver Conformance}, \ref{sec:Conformance / Driver Conformance / Socket 
> Driver Conformance} or \ref{sec:Conformance / Driver Conformance / IOMMU 
> Driver Conformance}.
>  \item Clause \ref{sec:Conformance / Legacy Interface: Transitional 
> Device and Transitional Driver Conformance}.
>\end{itemize}
>  \item[Device] A device MUST conform to four conformance clauses:
>\begin{itemize}
>  \item Clause \ref{sec:Conformance / Device Conformance}.
>  \item One of clauses \ref{sec:Conformance / Device Conformance / PCI 
> Device Conformance}, \ref{sec:Conformance / Device Conformance / MMIO Device 
> Conformance} or \ref{sec:Conformance / Device Conformance / Channel I/O 
> Device Conformance}.
> -\item One of clauses \ref{sec:Conformance / Device Conformance / Network 
> Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device 
> Conformance}, \ref{sec:Conformance / Device Conformance / Console Device 
> Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device 
> Conformance}, \ref{sec:Conformance / Device Conformance / Traditional Memory 
> Balloon Device Conformance}, \ref{sec:Conformance / Device Conformance / SCSI 
> Host Device Conformance}, \ref{sec:Conformance / Device Conformance / Input 
> Device Conformance}, \ref{sec:Conformance / Device Conformance / Crypto 
> Device Conformance} or \ref{sec:Conformance / Device Conformance / Socket 
> Device Conformance}.
> +\item One of clauses \ref{sec:Conformance / Device Conformance / Network 
> Device Conformance}, \ref{sec:Conformance / Device Conformance / Block Device 
> Conformance}, \ref{sec:Conformance / Device Conformance / Console Device 
> Conformance}, \ref{sec:Conformance / Device Conformance / Entropy Device 
> Conformance}, \ref{sec:Conforma

  1   2   3   >