Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Paolo Bonzini

On 04/06/21 19:22, Jason Gunthorpe wrote:

  4) The KVM interface is the very simple enable/disable WBINVD.
 Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
 to enable WBINVD at KVM.


The KVM interface is the same kvm-vfio device that exists already.  The 
userspace API does not need to change at all: adding one VFIO file 
descriptor with WBINVD enabled to the kvm-vfio device lets the VM use 
WBINVD functionality (see kvm_vfio_update_coherency).


Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. 
But it seems useless complication compared to just using what we have 
now, at least while VMs only use IOASIDs via VFIO.


Either way, there should be no policy attached to the add/delete 
operations.  KVM users want to add the VFIO (or IOASID) file descriptors 
to the device independent of WBINVD.  If userspace wants/needs to apply 
its own policy on whether to enable WBINVD or not, they can do it on the 
VFIO/IOASID side:



 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
it communicates its no-snoop configuration:
 - 0 enable, allow WBINVD
 - 1 automatic disable, block WBINVD if the platform
   IOMMU can police it (what we do today)
 - 2 force disable, do not allow BINVD ever


Though, like Alex, it's also not clear to me whether force-disable is 
useful.  Instead userspace can query the IOMMU or the device to ensure 
it's not enabled.


Paolo

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, June 4, 2021 8:34 PM
> 
> On Fri, Jun 04, 2021 at 06:08:28AM +, Tian, Kevin wrote:
> 
> > In Qemu case the problem is that it doesn't know the list of devices
> > that will be attached to an IOASID when it's created. This is a guest-
> > side knowledge which is conveyed one device at a time to Qemu
> > though vIOMMU.
> 
> At least for the guest side it is alot simpler because the vIOMMU
> being emulated will define nearly everything.
> 
> qemu will just have to ask the kernel for whatever it is the guest is
> doing. If the kernel can't do it then qemu has to SW emulate.
> 
> The no-snoop block may be the only thing that is under qemu's control
> because it is transparent to the guest.
> 
> This will probably become clearer as people start to define what the
> get_info should return.
> 

Sure. Just to clarify my comment that it is for " Perhaps creating an 
IOASID should pass in a list of the device labels that the IOASID will 
be used with". My point is that Qemu doesn't know this fact before
the guest completes binding page table to all relevant devices, while
IOASID must be created when the table is bound to first device. So
Qemu just needs to create IOASID with format that is required for the
current device. Incompatibility will be detected when attaching other
devices later.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, June 4, 2021 8:09 PM
> 
> On Fri, Jun 04, 2021 at 06:37:26AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Thursday, June 3, 2021 9:05 PM
> > >
> > > > >
> > > > > 3) Device accepts any PASIDs from the guest. No
> > > > >vPASID/pPASID translation is possible. (classic vfio_pci)
> > > > > 4) Device accepts any PASID from the guest and has an
> > > > >internal vPASID/pPASID translation (enhanced vfio_pci)
> > > >
> > > > what is enhanced vfio_pci? In my writing this is for mdev
> > > > which doesn't support ENQCMD
> > >
> > > This is a vfio_pci that mediates some element of the device interface
> > > to communicate the vPASID/pPASID table to the device, using Max's
> > > series for vfio_pci drivers to inject itself into VFIO.
> > >
> > > For instance a device might send a message through the PF that the VF
> > > has a certain vPASID/pPASID translation table. This would be useful
> > > for devices that cannot use ENQCMD but still want to support migration
> > > and thus need vPASID.
> >
> > I still don't quite get. If it's a PCI device why is PASID translation 
> > required?
> > Just delegate the per-RID PASID space to user as type-3 then migrating the
> > vPASID space is just straightforward.
> 
> This is only possible if we get rid of the global pPASID allocation
> (honestly is my preference as it makes the HW a lot simpler)
> 

In this proposal global vs. per-RID allocation is a per-device policy.
for vfio-pci it can always use per-RID (regardless of whether the
device is partially mediated or not) and no vPASID/pPASID conversion. 
Even for mdev if no ENQCMD we can still do per-RID conversion.
only for mdev which has ENQCMD we need global pPASID allocation.

I think this is the motivation you explained earlier that it's not good
to have one global PASID allocator in the kernel. per-RID vs. global
should be selected per device.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 03:29:18PM -0600, Alex Williamson wrote:
> On Fri, 4 Jun 2021 14:22:07 -0300
> Jason Gunthorpe  wrote:
> 
> > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > > On 04/06/21 18:03, Jason Gunthorpe wrote:  
> > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:  
> > > > > I don't want a security proof myself; I want to trust VFIO to make 
> > > > > the right
> > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > > 
> > > > > Given how KVM is just a device driver inside Linux, VMs should be a 
> > > > > slightly
> > > > > more roundabout way to do stuff that is accessible to bare metal; not 
> > > > > a way
> > > > > to gain extra privilege.  
> > > > 
> > > > Okay, fine, lets turn the question on its head then.
> > > > 
> > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > > application can make use of no-snoop optimizations. The ability of KVM
> > > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > > in a normal process context.
> > > > 
> > > > So, under what conditions do we want to allow VFIO to giave a process
> > > > elevated access to the CPU:  
> > > 
> > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > > #2+#3 would be worse than what we have today), but IIUC the proposal (was 
> > > it
> > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > > which then would be on VFIO and not on KVM.
> > 
> > At the end of the day we need an ioctl with two arguments:
> >  - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> >  - The KVM FD to control wbinvd support on
> > 
> > Philosophically it doesn't matter too much which subsystem that ioctl
> > lives, but we have these obnoxious cross module dependencies to
> > consider.. 
> > 
> > Framing the question, as you have, to be about the process, I think
> > explains why KVM doesn't really care what is decided, so long as the
> > process and the VM have equivalent rights.
> > 
> > Alex, how about a more fleshed out suggestion:
> > 
> >  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> > it communicates its no-snoop configuration:
> 
> Communicates to whom?

To the /dev/iommu FD which will have to maintain a list of devices
attached to it internally.

> >  - 0 enable, allow WBINVD
> >  - 1 automatic disable, block WBINVD if the platform
> >IOMMU can police it (what we do today)
> >  - 2 force disable, do not allow BINVD ever
> 
> The only thing we know about the device is whether or not Enable
> No-snoop is hard wired to zero, ie. it either can't generate no-snoop
> TLPs ("coherent-only") or it might ("assumed non-coherent").  

Here I am outlining the choice an also imagining we might want an
admin knob to select the three.

> If we're putting the policy decision in the hands of userspace they
> should have access to wbinvd if they own a device that is assumed
> non-coherent AND it's attached to an IOMMU (page table) that is not
> blocking no-snoop (a "non-coherent IOASID").

There are two parts here, like Paolo was leading too. If the process
has access to WBINVD and then if such an allowed process tells KVM to
turn on WBINVD in the guest.

If the process has a device and it has a way to create a non-coherent
IOASID, then that process has access to WBINVD.

For security it doesn't matter if the process actually creates the
non-coherent IOASID or not. An attacker will simply do the steps that
give access to WBINVD.

The important detail is that access to WBINVD does not compell the
process to tell KVM to turn on WBINVD. So a qemu with access to WBINVD
can still choose to create a secure guest by always using IOMMU_CACHE
in its page tables and not asking KVM to enable WBINVD.

This propsal shifts this policy decision from the kernel to userspace.
qemu is responsible to determine if KVM should enable wbinvd or not
based on if it was able to create IOASID's with IOMMU_CACHE.

> Conversely, a user could create a non-coherent IOASID and attach any
> device to it, regardless of IOMMU backing capabilities.  Only if an
> assumed non-coherent device is attached would the wbinvd be allowed.

Right, this is exactly the point. Since the user gets to pick if the
IOASID is coherent or not then an attacker can always reach WBINVD
using only the device FD. Additional checks don't add to the security
of the process.

The additional checks you are describing add to the security of the
guest, however qemu is capable of doing them without more help from the
kernel.

It is the strenth of Paolo's model that KVM should not be able to do
optionally less, not more than the process itself can do.

> > It is pretty simple from a /dev/ioasid perpsective, covers todays
> > compat requirement, gives some future option to allow the no-snoop
> > optimization, and gives a new option for qemu to totally block wbinvd
> > no matte

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Alex Williamson
On Fri, 4 Jun 2021 09:13:37 -0300
Jason Gunthorpe  wrote:

> On Thu, Jun 03, 2021 at 02:41:36PM -0600, Alex Williamson wrote:
> 
> > Could you clarify "vfio_driver"?
> 
> This is the thing providing the vfio_device_ops function pointers.
> 
> So vfio-pci can't know anything about this (although your no-snoop
> control probing idea makes sense to me)
> 
> But vfio_mlx5_pci can know
> 
> So can mdev_idxd
> 
> And kvmgt

A capability on VFIO_DEVICE_GET_INFO could provide a hint to userspace.
Stock vfio-pci could fill it out to the extent advertising if the
device is capable of non-coherent DMA based on the Enable No-snoop
probing, the device specific vfio_drivers could set it based on
knowledge of the device behavior.  Another bit might indicate a
preference to not suppress non-coherent DMA at the IOMMU.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Alex Williamson
On Fri, 4 Jun 2021 14:22:07 -0300
Jason Gunthorpe  wrote:

> On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > On 04/06/21 18:03, Jason Gunthorpe wrote:  
> > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:  
> > > > I don't want a security proof myself; I want to trust VFIO to make the 
> > > > right
> > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > 
> > > > Given how KVM is just a device driver inside Linux, VMs should be a 
> > > > slightly
> > > > more roundabout way to do stuff that is accessible to bare metal; not a 
> > > > way
> > > > to gain extra privilege.  
> > > 
> > > Okay, fine, lets turn the question on its head then.
> > > 
> > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > application can make use of no-snoop optimizations. The ability of KVM
> > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > in a normal process context.
> > > 
> > > So, under what conditions do we want to allow VFIO to giave a process
> > > elevated access to the CPU:  
> > 
> > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > #2+#3 would be worse than what we have today), but IIUC the proposal (was it
> > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > which then would be on VFIO and not on KVM.
> 
> At the end of the day we need an ioctl with two arguments:
>  - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
>  - The KVM FD to control wbinvd support on
> 
> Philosophically it doesn't matter too much which subsystem that ioctl
> lives, but we have these obnoxious cross module dependencies to
> consider.. 
> 
> Framing the question, as you have, to be about the process, I think
> explains why KVM doesn't really care what is decided, so long as the
> process and the VM have equivalent rights.
> 
> Alex, how about a more fleshed out suggestion:
> 
>  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> it communicates its no-snoop configuration:

Communicates to whom?

>  - 0 enable, allow WBINVD
>  - 1 automatic disable, block WBINVD if the platform
>IOMMU can police it (what we do today)
>  - 2 force disable, do not allow BINVD ever

The only thing we know about the device is whether or not Enable
No-snoop is hard wired to zero, ie. it either can't generate no-snoop
TLPs ("coherent-only") or it might ("assumed non-coherent").  If
we're putting the policy decision in the hands of userspace they should
have access to wbinvd if they own a device that is assumed
non-coherent AND it's attached to an IOMMU (page table) that is not
blocking no-snoop (a "non-coherent IOASID").

I think that means that the IOASID needs to be created (IOASID_ALLOC)
with a flag that specifies whether this address space is coherent
(IOASID_GET_INFO probably needs a flag/cap to expose if the system
supports this).  All mappings in this IOASID would use IOMMU_CACHE and
and devices attached to it would be required to be backed by an IOMMU
capable of IOMMU_CAP_CACHE_COHERENCY (attach fails otherwise).  If only
these IOASIDs exist, access to wbinvd would not be provided.  (How does
a user provided page table work? - reserved bit set, user error?)

Conversely, a user could create a non-coherent IOASID and attach any
device to it, regardless of IOMMU backing capabilities.  Only if an
assumed non-coherent device is attached would the wbinvd be allowed.

I think that means that an EXECUTE_WBINVD ioctl lives on the IOASIDFD
and the IOASID world needs to understand the device's ability to
generate non-coherent DMA.  This wbinvd ioctl would be a no-op (or
some known errno) unless a non-coherent IOASID exists with a potentially
non-coherent device attached.
 
> vfio_pci may want to take this from an admin configuration knob
> someplace. It allows the admin to customize if they want.
> 
> If we can figure out a way to autodetect 2 from vfio_pci, all the
> better
> 
>  2) There is some IOMMU_EXECUTE_WBINVD IOCTL that allows userspace
> to access wbinvd so it can make use of the no snoop optimization.
> 
> wbinvd is allowed when:
>   - A device is joined with mode #0
>   - A device is joined with mode #1 and the IOMMU cannot block
> no-snoop (today)
> 
>  3) The IOASID's don't care about this at all. If IOMMU_EXECUTE_WBINVD
> is blocked and userspace doesn't request to block no-snoop in the
> IOASID then it is a userspace error.

In my model above, the IOASID is central to this.
 
>  4) The KVM interface is the very simple enable/disable WBINVD.
> Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
> to enable WBINVD at KVM.

Right, and in the new world order, vfio is only a device driver, the
IOASID manages the device's DMA.  wbinvd is only necessary relative to
non-coherent DMA, which seems like QEMU needs to bump KVM with an
ioasidfd.
 
> It is pret

Re: [PATCH v2 0/4] iommu/amd: Enable page-selective flushes

2021-06-04 Thread Nadav Amit


> On Jun 4, 2021, at 11:53 AM, Robin Murphy  wrote:
> 
> On 2021-06-04 18:10, Nadav Amit wrote:
>>> On Jun 4, 2021, at 8:38 AM, Joerg Roedel  wrote:
>>> 
>>> Hi Nadav,
>>> 
>>> [Adding Robin]
>>> 
>>> On Mon, May 24, 2021 at 03:41:55PM -0700, Nadav Amit wrote:
 Nadav Amit (4):
  iommu/amd: Fix wrong parentheses on page-specific invalidations
>>> 
>>> This patch is already upstream in v5.13-rc4. Please rebase to that
>>> version.
>> I guess it would be rc5 by the time I send it.
>>> 
  iommu/amd: Selective flush on unmap
  iommu/amd: Do not sync on page size changes
  iommu/amd: Do not use flush-queue when NpCache is on
>>> 
>>> And I think there have been objections from Robin Murphy on Patch 3,
>>> have those been worked out?
>> I am still waiting for Robin’s feedback on my proposed changes. If he does 
>> not respond soon, I will drop this patch for now.
> 
> Apologies, it feels like I've spent most of this week fighting fires,
> and a great deal of email got skimmed and mentally filed under "nothing
> so wrong that I need to respond immediately"...
> 
> FWIW I would have written the simpler patch below, but beyond that I
> think it might start descending into bikeshedding - if you still prefer
> your more comprehensive refactoring, or something in between, then don't
> let my personal preference in style/complexity trade-offs stand in the
> way of getting a useful functional change into the AMD driver. Whichever
> way, though, I *am* now sold on the idea of having some kerneldoc to
> clarify these things.

Thanks, I appreciate your feedback.

I will add kerneldoc as you indicated.

I see you took some parts of the patch I did for MediaTek, but I think this is 
not good enough for AMD, since AMD behavior should be different than MediaTek - 
they have different needs:

MediaTek wants as few IOTLB flushes as possible, even if it results in flushing 
of many irrelevant (unmodified) entries between start and end. That’s the 
reason it can just use iommu_iotlb_gather_update_range().

In contrast, for AMD we do not want to flush too many irrelevant entries, 
specifically if the the IOMMU is virtualized. When an IOTLB flush is initiated 
by the VM, the hypervisor needs to scan the IOMMU page-tables for changes and 
synchronize it with the physical IOMMU. You don’t want this range to be too 
big, and that is the reason I needed iommu_iotlb_gather_is_disjoint().

I will add documentation, since clearly this information was not conveyed well 
enough.

Thanks again,
Nadav


signature.asc
Description: Message signed with OpenPGP
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4] iommu/of: Fix pci_request_acs() before enumerating PCI devices

2021-06-04 Thread Bjorn Helgaas
[+cc John, who tested 6bf6c24720d3]

On Fri, May 21, 2021 at 03:03:24AM +, Wang Xingang wrote:
> From: Xingang Wang 
> 
> When booting with devicetree, the pci_request_acs() is called after the
> enumeration and initialization of PCI devices, thus the ACS is not
> enabled. And ACS should be enabled when IOMMU is detected for the
> PCI host bridge, so add check for IOMMU before probe of PCI host and call
> pci_request_acs() to make sure ACS will be enabled when enumerating PCI
> devices.

I'm happy to apply this, but I'm a little puzzled about 6bf6c24720d3
("iommu/of: Request ACS from the PCI core when configuring IOMMU
linkage").  It was tested and fixed a problem, but I don't understand
how.

6bf6c24720d3 added the call to pci_request_acs() in
of_iommu_configure() so it currently looks like this:

  of_iommu_configure(dev, ...)
  {
if (dev_is_pci(dev))
  pci_request_acs();

pci_request_acs() sets pci_acs_enable, which tells us to enable ACS
when enumerating PCI devices in the future.  But we only call
pci_request_acs() if we already *have* a PCI device.

So maybe 6bf6c24720d3 fixed a problem for *some* PCI devices, but not
all?  E.g., did we call of_iommu_configure() for one PCI device before
enumerating the rest?

> Fixes: 6bf6c24720d33 ("iommu/of: Request ACS from the PCI core when
> configuring IOMMU linkage")
> Signed-off-by: Xingang Wang 
> ---
>  drivers/iommu/of_iommu.c | 1 -
>  drivers/pci/of.c | 8 +++-
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
> index a9d2df001149..54a14da242cc 100644
> --- a/drivers/iommu/of_iommu.c
> +++ b/drivers/iommu/of_iommu.c
> @@ -205,7 +205,6 @@ const struct iommu_ops *of_iommu_configure(struct device 
> *dev,
>   .np = master_np,
>   };
>  
> - pci_request_acs();
>   err = pci_for_each_dma_alias(to_pci_dev(dev),
>of_pci_iommu_init, &info);
>   } else {
> diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> index da5b414d585a..2313c3f848b0 100644
> --- a/drivers/pci/of.c
> +++ b/drivers/pci/of.c
> @@ -581,9 +581,15 @@ static int pci_parse_request_of_pci_ranges(struct device 
> *dev,
>  
>  int devm_of_pci_bridge_init(struct device *dev, struct pci_host_bridge 
> *bridge)
>  {
> - if (!dev->of_node)
> + struct device_node *node = dev->of_node;
> +
> + if (!node)
>   return 0;
>  
> + /* Detect IOMMU and make sure ACS will be enabled */
> + if (of_property_read_bool(node, "iommu-map"))
> + pci_request_acs();
> +
>   bridge->swizzle_irq = pci_common_swizzle;
>   bridge->map_irq = of_irq_parse_and_map_pci;
>  
> -- 
> 2.19.1
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/4] iommu/amd: Enable page-selective flushes

2021-06-04 Thread Robin Murphy

On 2021-06-04 18:10, Nadav Amit wrote:




On Jun 4, 2021, at 8:38 AM, Joerg Roedel  wrote:

Hi Nadav,

[Adding Robin]

On Mon, May 24, 2021 at 03:41:55PM -0700, Nadav Amit wrote:

Nadav Amit (4):
  iommu/amd: Fix wrong parentheses on page-specific invalidations


This patch is already upstream in v5.13-rc4. Please rebase to that
version.


I guess it would be rc5 by the time I send it.




  iommu/amd: Selective flush on unmap
  iommu/amd: Do not sync on page size changes
  iommu/amd: Do not use flush-queue when NpCache is on


And I think there have been objections from Robin Murphy on Patch 3,
have those been worked out?


I am still waiting for Robin’s feedback on my proposed changes. If he does not 
respond soon, I will drop this patch for now.


Apologies, it feels like I've spent most of this week fighting fires,
and a great deal of email got skimmed and mentally filed under "nothing
so wrong that I need to respond immediately"...

FWIW I would have written the simpler patch below, but beyond that I
think it might start descending into bikeshedding - if you still prefer
your more comprehensive refactoring, or something in between, then don't
let my personal preference in style/complexity trade-offs stand in the
way of getting a useful functional change into the AMD driver. Whichever
way, though, I *am* now sold on the idea of having some kerneldoc to
clarify these things.

Thanks,
Robin.

->8-
From: Robin Murphy 
Subject: [PATCH] iommu: Improve iommu_iotlb_gather helpers

The Mediatek driver is not the only one which might want a basic
address-based gathering behaviour, so although it's arguably simple
enough to open-code, let's factor it out for the sake of cleanliness.
Let's also take this opportunity to document the intent of these
helpers for clarity.

Signed-off-by: Robin Murphy 
---
 drivers/iommu/mtk_iommu.c |  6 +-
 include/linux/iommu.h | 32 
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index e06b8a0e2b56..cd457487ce81 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -521,12 +521,8 @@ static size_t mtk_iommu_unmap(struct iommu_domain *domain,
  struct iommu_iotlb_gather *gather)
 {
struct mtk_iommu_domain *dom = to_mtk_domain(domain);
-   unsigned long end = iova + size - 1;
 
-	if (gather->start > iova)

-   gather->start = iova;
-   if (gather->end < end)
-   gather->end = end;
+   iommu_iotlb_gather_add_range(gather, iova, size);
return dom->iop->unmap(dom->iop, iova, size, gather);
 }
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h

index 32d448050bf7..5f036e991937 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -497,6 +497,38 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
iommu_iotlb_gather_init(iotlb_gather);
 }
 
+/**

+ * iommu_iotlb_gather_add_range - Gather for address-based TLB invalidation
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to build arbitrarily-sized invalidation commands
+ * where only the address range matters, and simply minimising intermediate
+ * syncs is preferred.
+ */
+static inline void iommu_iotlb_gather_add_range(struct iommu_iotlb_gather 
*gather,
+   unsigned long iova, size_t size)
+{
+   unsigned long end = iova + size - 1;
+
+   if (gather->start > iova)
+   gather->start = iova;
+   if (gather->end < end)
+   gather->end = end;
+}
+
+/**
+ * iommu_iotlb_gather_add_page - Gather for page-based TLB invalidation
+ * @domain: IOMMU domain to be invalidated
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to build invalidation commands based on individual
+ * pages, or with page size/table level hints which cannot be gathered if they
+ * differ.
+ */
 static inline void iommu_iotlb_gather_add_page(struct iommu_domain *domain,
   struct iommu_iotlb_gather 
*gather,
   unsigned long iova, size_t size)
--
2.25.1
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jacob Pan
Hi Jason,

On Fri, 4 Jun 2021 13:22:00 -0300, Jason Gunthorpe  wrote:

> > 
> > Yes, in that case we should support both. Give the device driver a
> > chance to handle the IOPF if it can.  
> 
> Huh?
> 
> The device driver does not "handle the IOPF" the device driver might
> inject the IOPF.
You are right, I got confused with the native case where device drivers can
handle the fault, or do something about it.

Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 00/15] Restricted DMA

2021-06-04 Thread Will Deacon
Hi Claire,

On Thu, May 27, 2021 at 08:58:30PM +0800, Claire Chang wrote:
> This series implements mitigations for lack of DMA access control on
> systems without an IOMMU, which could result in the DMA accessing the
> system memory at unexpected times and/or unexpected addresses, possibly
> leading to data leakage or corruption.
> 
> For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
> not behind an IOMMU. As PCI-e, by design, gives the device full access to
> system memory, a vulnerability in the Wi-Fi firmware could easily escalate
> to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
> full chain of exploits; [2], [3]).
> 
> To mitigate the security concerns, we introduce restricted DMA. Restricted
> DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
> specially allocated region and does memory allocation from the same region.
> The feature on its own provides a basic level of protection against the DMA
> overwriting buffer contents at unexpected times. However, to protect
> against general data leakage and system memory corruption, the system needs
> to provide a way to restrict the DMA to a predefined memory region (this is
> usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]).
> 
> [1a] 
> https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html
> [1b] 
> https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html
> [2] https://blade.tencent.com/en/advisories/qualpwn/
> [3] 
> https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/
> [4] 
> https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132
> 
> v8:
> - Fix reserved-memory.txt and add the reg property in example.
> - Fix sizeof for of_property_count_elems_of_size in
>   drivers/of/address.c#of_dma_set_restricted_buffer.
> - Apply Will's suggestion to try the OF node having DMA configuration in
>   drivers/of/address.c#of_dma_set_restricted_buffer.
> - Fix typo in the comment of 
> drivers/of/address.c#of_dma_set_restricted_buffer.
> - Add error message for PageHighMem in
>   kernel/dma/swiotlb.c#rmem_swiotlb_device_init and move it to
>   rmem_swiotlb_setup.
> - Fix the message string in rmem_swiotlb_setup.

Thanks for the v8. It works for me out of the box on arm64 under KVM, so:

Tested-by: Will Deacon 

Note that something seems to have gone wrong with the mail threading, so
the last 5 patches ended up as a separate thread for me. Probably worth
posting again with all the patches in one place, if you can.

Cheers,

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 10:27:43AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Fri, 4 Jun 2021 09:05:55 -0300, Jason Gunthorpe  wrote:
> 
> > On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote:
> > 
> > > I think once it binds a device to an IOASID fd, QEMU will want to probe
> > > what hardware features are available before going further with the
> > > vIOMMU setup (is there PASID, PRI, which page table formats are
> > > supported,  
> > 
> > I think David's point was that qemu should be told what vIOMMU it is
> > emulating exactly (right down to what features it has) and then
> > the goal is simply to match what the vIOMMU needs with direct HW
> > support via /dev/ioasid and fall back to SW emulation when not
> > possible.
> > 
> > If qemu wants to have some auto-configuration: 'pass host IOMMU
> > capabilities' similar to the CPU flags then qemu should probe the
> > /dev/ioasid - and maybe we should just return some highly rolled up
> > "this is IOMMU HW ID ARM SMMU vXYZ" out of some query to guide qemu in
> > doing this.
> > 
> There can be mixed types of physical IOMMUs on the host. So not until a
> device is attached, we would not know if the vIOMMU can match the HW
> support of the device's IOMMU. Perhaps, vIOMMU should check the
> least common denominator features before commit.

qemu has to set the vIOMMU at VM startup time, so if it is running in
some "copy host" mode the only thing it can do is evaluate the VFIO
devices that are present at boot and select a vIOMMU from that list.

Probably would pick the most capable physical IOMMU and software
emulate the reset.

platforms really should avoid creating wildly divergent IOMMUs in the
same system if they want to support virtualization effectively.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/amd: Tidy up DMA ops init

2021-06-04 Thread Robin Murphy

On 2021-06-04 16:26, Joerg Roedel wrote:

On Thu, Jun 03, 2021 at 02:48:21PM +0100, Robin Murphy wrote:

As discussed on the report thread, I think it makes most sense to merge
this as a fix for 5.13 and not worry about any backporting.

  drivers/iommu/amd/amd_iommu.h |  2 --
  drivers/iommu/amd/init.c  |  5 -
  drivers/iommu/amd/iommu.c | 31 +--
  3 files changed, 13 insertions(+), 25 deletions(-)


Applied for v5.13, thanks Robin et al for the quick work on this
regression.

Robin, do you by chance have a Fixes tag which I can add?


For the sake of justifying this as "fix" rather than "cleanup", you may 
as well use the flush queue commit cited in the patch log - I maintain 
there's nothing technically wrong with that commit itself, but it is the 
point at which the underlying issue becomes worth fixing due to how they 
interact ;)


Cheers,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jacob Pan
Hi Jason,

On Fri, 4 Jun 2021 09:05:55 -0300, Jason Gunthorpe  wrote:

> On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote:
> 
> > I think once it binds a device to an IOASID fd, QEMU will want to probe
> > what hardware features are available before going further with the
> > vIOMMU setup (is there PASID, PRI, which page table formats are
> > supported,  
> 
> I think David's point was that qemu should be told what vIOMMU it is
> emulating exactly (right down to what features it has) and then
> the goal is simply to match what the vIOMMU needs with direct HW
> support via /dev/ioasid and fall back to SW emulation when not
> possible.
> 
> If qemu wants to have some auto-configuration: 'pass host IOMMU
> capabilities' similar to the CPU flags then qemu should probe the
> /dev/ioasid - and maybe we should just return some highly rolled up
> "this is IOMMU HW ID ARM SMMU vXYZ" out of some query to guide qemu in
> doing this.
> 
There can be mixed types of physical IOMMUs on the host. So not until a
device is attached, we would not know if the vIOMMU can match the HW
support of the device's IOMMU. Perhaps, vIOMMU should check the
least common denominator features before commit.

Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> On 04/06/21 18:03, Jason Gunthorpe wrote:
> > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> > > I don't want a security proof myself; I want to trust VFIO to make the 
> > > right
> > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > 
> > > Given how KVM is just a device driver inside Linux, VMs should be a 
> > > slightly
> > > more roundabout way to do stuff that is accessible to bare metal; not a 
> > > way
> > > to gain extra privilege.
> > 
> > Okay, fine, lets turn the question on its head then.
> > 
> > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > application can make use of no-snoop optimizations. The ability of KVM
> > to execute wbinvd should be tied to the ability of that IOCTL to run
> > in a normal process context.
> > 
> > So, under what conditions do we want to allow VFIO to giave a process
> > elevated access to the CPU:
> 
> Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> #2+#3 would be worse than what we have today), but IIUC the proposal (was it
> yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> which then would be on VFIO and not on KVM.  

At the end of the day we need an ioctl with two arguments:
 - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
 - The KVM FD to control wbinvd support on

Philosophically it doesn't matter too much which subsystem that ioctl
lives, but we have these obnoxious cross module dependencies to
consider.. 

Framing the question, as you have, to be about the process, I think
explains why KVM doesn't really care what is decided, so long as the
process and the VM have equivalent rights.

Alex, how about a more fleshed out suggestion:

 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
it communicates its no-snoop configuration:
 - 0 enable, allow WBINVD
 - 1 automatic disable, block WBINVD if the platform
   IOMMU can police it (what we do today)
 - 2 force disable, do not allow BINVD ever

vfio_pci may want to take this from an admin configuration knob
someplace. It allows the admin to customize if they want.

If we can figure out a way to autodetect 2 from vfio_pci, all the
better

 2) There is some IOMMU_EXECUTE_WBINVD IOCTL that allows userspace
to access wbinvd so it can make use of the no snoop optimization.

wbinvd is allowed when:
  - A device is joined with mode #0
  - A device is joined with mode #1 and the IOMMU cannot block
no-snoop (today)

 3) The IOASID's don't care about this at all. If IOMMU_EXECUTE_WBINVD
is blocked and userspace doesn't request to block no-snoop in the
IOASID then it is a userspace error.

 4) The KVM interface is the very simple enable/disable WBINVD.
Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
to enable WBINVD at KVM.

It is pretty simple from a /dev/ioasid perpsective, covers todays
compat requirement, gives some future option to allow the no-snoop
optimization, and gives a new option for qemu to totally block wbinvd
no matter what.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/4] iommu/amd: Enable page-selective flushes

2021-06-04 Thread Nadav Amit


> On Jun 4, 2021, at 8:38 AM, Joerg Roedel  wrote:
> 
> Hi Nadav,
> 
> [Adding Robin]
> 
> On Mon, May 24, 2021 at 03:41:55PM -0700, Nadav Amit wrote:
>> Nadav Amit (4):
>>  iommu/amd: Fix wrong parentheses on page-specific invalidations
> 
> This patch is already upstream in v5.13-rc4. Please rebase to that
> version.

I guess it would be rc5 by the time I send it.

> 
>>  iommu/amd: Selective flush on unmap
>>  iommu/amd: Do not sync on page size changes
>>  iommu/amd: Do not use flush-queue when NpCache is on
> 
> And I think there have been objections from Robin Murphy on Patch 3,
> have those been worked out?

I am still waiting for Robin’s feedback on my proposed changes. If he does not 
respond soon, I will drop this patch for now.

Thanks,
Nadav
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 3/4] iommu: rockchip: Add internal ops to handle variants

2021-06-04 Thread Benjamin Gaignard
Add internal ops to be able to handle incoming variant v2.
The goal is to keep the overall structure of the framework but
to allow to add the evolution of this hardware block.

The ops are global for a SoC because iommu domains are not
attached to a specific devices if they are for a virtuel device like
drm. Use a global variable shouldn't be since SoC usually doesn't
embedded different versions of the iommu hardware block.
If that happen one day a WARN_ON will be displayed at probe time.

Signed-off-by: Benjamin Gaignard 
Reviewed-by: Heiko Stuebner 
---
version 7:
 - Set DMA mask
 - Add function to convert dma address to dte

version 6:
 - Remove #include 
 - Remove pt_address_mask field
 - Only use once of_device_get_match_data
 - Return an error if ops don't match

version 5:
 - Use of_device_get_match_data()
 - Add internal ops inside the driver

 drivers/iommu/rockchip-iommu.c | 86 +-
 1 file changed, 64 insertions(+), 22 deletions(-)

diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c
index 7a2932772fdf..bd2cf7f08c71 100644
--- a/drivers/iommu/rockchip-iommu.c
+++ b/drivers/iommu/rockchip-iommu.c
@@ -96,6 +96,15 @@ static const char * const rk_iommu_clocks[] = {
"aclk", "iface",
 };
 
+struct rk_iommu_ops {
+   phys_addr_t (*pt_address)(u32 dte);
+   u32 (*mk_dtentries)(dma_addr_t pt_dma);
+   u32 (*mk_ptentries)(phys_addr_t page, int prot);
+   phys_addr_t (*dte_addr_phys)(u32 addr);
+   u32 (*dma_addr_dte)(dma_addr_t dt_dma);
+   u64 dma_bit_mask;
+};
+
 struct rk_iommu {
struct device *dev;
void __iomem **bases;
@@ -116,6 +125,7 @@ struct rk_iommudata {
 };
 
 static struct device *dma_dev;
+static const struct rk_iommu_ops *rk_ops;
 
 static inline void rk_table_flush(struct rk_iommu_domain *dom, dma_addr_t dma,
  unsigned int count)
@@ -215,11 +225,6 @@ static inline u32 rk_mk_dte(dma_addr_t pt_dma)
 #define RK_PTE_PAGE_READABLE  BIT(1)
 #define RK_PTE_PAGE_VALID BIT(0)
 
-static inline phys_addr_t rk_pte_page_address(u32 pte)
-{
-   return (phys_addr_t)pte & RK_PTE_PAGE_ADDRESS_MASK;
-}
-
 static inline bool rk_pte_is_page_valid(u32 pte)
 {
return pte & RK_PTE_PAGE_VALID;
@@ -448,10 +453,10 @@ static int rk_iommu_force_reset(struct rk_iommu *iommu)
 * and verifying that upper 5 nybbles are read back.
 */
for (i = 0; i < iommu->num_mmu; i++) {
-   rk_iommu_write(iommu->bases[i], RK_MMU_DTE_ADDR, 
DTE_ADDR_DUMMY);
+   dte_addr = rk_ops->pt_address(DTE_ADDR_DUMMY);
+   rk_iommu_write(iommu->bases[i], RK_MMU_DTE_ADDR, dte_addr);
 
-   dte_addr = rk_iommu_read(iommu->bases[i], RK_MMU_DTE_ADDR);
-   if (dte_addr != (DTE_ADDR_DUMMY & RK_DTE_PT_ADDRESS_MASK)) {
+   if (dte_addr != rk_iommu_read(iommu->bases[i], 
RK_MMU_DTE_ADDR)) {
dev_err(iommu->dev, "Error during raw reset. 
MMU_DTE_ADDR is not functioning\n");
return -EFAULT;
}
@@ -470,6 +475,16 @@ static int rk_iommu_force_reset(struct rk_iommu *iommu)
return 0;
 }
 
+static inline phys_addr_t rk_dte_addr_phys(u32 addr)
+{
+   return (phys_addr_t)addr;
+}
+
+static inline u32 rk_dma_addr_dte(dma_addr_t dt_dma)
+{
+   return dt_dma;
+}
+
 static void log_iova(struct rk_iommu *iommu, int index, dma_addr_t iova)
 {
void __iomem *base = iommu->bases[index];
@@ -489,7 +504,7 @@ static void log_iova(struct rk_iommu *iommu, int index, 
dma_addr_t iova)
page_offset = rk_iova_page_offset(iova);
 
mmu_dte_addr = rk_iommu_read(base, RK_MMU_DTE_ADDR);
-   mmu_dte_addr_phys = (phys_addr_t)mmu_dte_addr;
+   mmu_dte_addr_phys = rk_ops->dte_addr_phys(mmu_dte_addr);
 
dte_addr_phys = mmu_dte_addr_phys + (4 * dte_index);
dte_addr = phys_to_virt(dte_addr_phys);
@@ -498,14 +513,14 @@ static void log_iova(struct rk_iommu *iommu, int index, 
dma_addr_t iova)
if (!rk_dte_is_pt_valid(dte))
goto print_it;
 
-   pte_addr_phys = rk_dte_pt_address(dte) + (pte_index * 4);
+   pte_addr_phys = rk_ops->pt_address(dte) + (pte_index * 4);
pte_addr = phys_to_virt(pte_addr_phys);
pte = *pte_addr;
 
if (!rk_pte_is_page_valid(pte))
goto print_it;
 
-   page_addr_phys = rk_pte_page_address(pte) + page_offset;
+   page_addr_phys = rk_ops->pt_address(pte) + page_offset;
page_flags = pte & RK_PTE_PAGE_FLAGS_MASK;
 
 print_it:
@@ -601,13 +616,13 @@ static phys_addr_t rk_iommu_iova_to_phys(struct 
iommu_domain *domain,
if (!rk_dte_is_pt_valid(dte))
goto out;
 
-   pt_phys = rk_dte_pt_address(dte);
+   pt_phys = rk_ops->pt_address(dte);
page_table = (u32 *)phys_to_virt(pt_phys);
pte = page_table[rk_iova_pte_index(iova)];
if (!rk_pte_is_page_valid(pte))
got

[PATCH v8 1/4] dt-bindings: iommu: rockchip: Convert IOMMU to DT schema

2021-06-04 Thread Benjamin Gaignard
Convert Rockchip IOMMU to DT schema

Signed-off-by: Benjamin Gaignard 
Reviewed-by: Rob Herring 
Reviewed-by: Heiko Stuebner 
---
 .../bindings/iommu/rockchip,iommu.txt | 38 -
 .../bindings/iommu/rockchip,iommu.yaml| 80 +++
 2 files changed, 80 insertions(+), 38 deletions(-)
 delete mode 100644 Documentation/devicetree/bindings/iommu/rockchip,iommu.txt
 create mode 100644 Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml

diff --git a/Documentation/devicetree/bindings/iommu/rockchip,iommu.txt 
b/Documentation/devicetree/bindings/iommu/rockchip,iommu.txt
deleted file mode 100644
index 6ecefea1c6f9..
--- a/Documentation/devicetree/bindings/iommu/rockchip,iommu.txt
+++ /dev/null
@@ -1,38 +0,0 @@
-Rockchip IOMMU
-==
-
-A Rockchip DRM iommu translates io virtual addresses to physical addresses for
-its master device.  Each slave device is bound to a single master device, and
-shares its clocks, power domain and irq.
-
-Required properties:
-- compatible  : Should be "rockchip,iommu"
-- reg : Address space for the configuration registers
-- interrupts  : Interrupt specifier for the IOMMU instance
-- interrupt-names : Interrupt name for the IOMMU instance
-- #iommu-cells: Should be <0>.  This indicates the iommu is a
-"single-master" device, and needs no additional information
-to associate with its master device.  See:
-Documentation/devicetree/bindings/iommu/iommu.txt
-- clocks  : A list of clocks required for the IOMMU to be accessible by
-the host CPU.
-- clock-names : Should contain the following:
-   "iface" - Main peripheral bus clock (PCLK/HCL) (required)
-   "aclk"  - AXI bus clock (required)
-
-Optional properties:
-- rockchip,disable-mmu-reset : Don't use the mmu reset operation.
-  Some mmu instances may produce unexpected results
-  when the reset operation is used.
-
-Example:
-
-   vopl_mmu: iommu@ff940300 {
-   compatible = "rockchip,iommu";
-   reg = <0xff940300 0x100>;
-   interrupts = ;
-   interrupt-names = "vopl_mmu";
-   clocks = <&cru ACLK_VOP1>, <&cru HCLK_VOP1>;
-   clock-names = "aclk", "iface";
-   #iommu-cells = <0>;
-   };
diff --git a/Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml 
b/Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml
new file mode 100644
index ..099fc2578b54
--- /dev/null
+++ b/Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml
@@ -0,0 +1,80 @@
+# SPDX-License-Identifier: GPL-2.0-only
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/iommu/rockchip,iommu.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Rockchip IOMMU
+
+maintainers:
+  - Heiko Stuebner 
+
+description: |+
+  A Rockchip DRM iommu translates io virtual addresses to physical addresses 
for
+  its master device. Each slave device is bound to a single master device and
+  shares its clocks, power domain and irq.
+
+  For information on assigning IOMMU controller to its peripheral devices,
+  see generic IOMMU bindings.
+
+properties:
+  compatible:
+const: rockchip,iommu
+
+  reg:
+items:
+  - description: configuration registers for MMU instance 0
+  - description: configuration registers for MMU instance 1
+minItems: 1
+maxItems: 2
+
+  interrupts:
+items:
+  - description: interruption for MMU instance 0
+  - description: interruption for MMU instance 1
+minItems: 1
+maxItems: 2
+
+  clocks:
+items:
+  - description: Core clock
+  - description: Interface clock
+
+  clock-names:
+items:
+  - const: aclk
+  - const: iface
+
+  "#iommu-cells":
+const: 0
+
+  rockchip,disable-mmu-reset:
+$ref: /schemas/types.yaml#/definitions/flag
+description: |
+  Do not use the mmu reset operation.
+  Some mmu instances may produce unexpected results
+  when the reset operation is used.
+
+required:
+  - compatible
+  - reg
+  - interrupts
+  - clocks
+  - clock-names
+  - "#iommu-cells"
+
+additionalProperties: false
+
+examples:
+  - |
+#include 
+#include 
+
+vopl_mmu: iommu@ff940300 {
+  compatible = "rockchip,iommu";
+  reg = <0xff940300 0x100>;
+  interrupts = ;
+  clocks = <&cru ACLK_VOP1>, <&cru HCLK_VOP1>;
+  clock-names = "aclk", "iface";
+  #iommu-cells = <0>;
+};
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 4/4] iommu: rockchip: Add support for iommu v2

2021-06-04 Thread Benjamin Gaignard
This second version of the hardware block has a different bits
mapping for page table entries.
Add the ops matching to this new mapping.
Define a new compatible to distinguish it from the first version.

Signed-off-by: Benjamin Gaignard 
Reviewed-by: Heiko Stuebner 
---
version 8:
 - Fix compilation issue

version 7:
 - Set dma_bit_mask field.
 - Add rk_dma_addr_dte_v2 function

version 5:
 - Use internal ops to support v2 hardware block
 - Use GENMASK macro.
 - Keep rk_dte_pt_address() and rk_dte_pt_address_v2() separated
   because I believe that is more readable like this.
 - Do not duplicate code.

 drivers/iommu/rockchip-iommu.c | 85 ++
 1 file changed, 85 insertions(+)

diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c
index bd2cf7f08c71..16dd2bf4a859 100644
--- a/drivers/iommu/rockchip-iommu.c
+++ b/drivers/iommu/rockchip-iommu.c
@@ -189,6 +189,33 @@ static inline phys_addr_t rk_dte_pt_address(u32 dte)
return (phys_addr_t)dte & RK_DTE_PT_ADDRESS_MASK;
 }
 
+/*
+ * In v2:
+ * 31:12 - PT address bit 31:0
+ * 11: 8 - PT address bit 35:32
+ *  7: 4 - PT address bit 39:36
+ *  3: 1 - Reserved
+ * 0 - 1 if PT @ PT address is valid
+ */
+#define RK_DTE_PT_ADDRESS_MASK_V2 GENMASK_ULL(31, 4)
+#define DTE_HI_MASK1   GENMASK(11, 8)
+#define DTE_HI_MASK2   GENMASK(7, 4)
+#define DTE_HI_SHIFT1  24 /* shift bit 8 to bit 32 */
+#define DTE_HI_SHIFT2  32 /* shift bit 4 to bit 36 */
+#define PAGE_DESC_HI_MASK1 GENMASK_ULL(39, 36)
+#define PAGE_DESC_HI_MASK2 GENMASK_ULL(35, 32)
+
+static inline phys_addr_t rk_dte_pt_address_v2(u32 dte)
+{
+   u64 dte_v2 = dte;
+
+   dte_v2 = ((dte_v2 & DTE_HI_MASK2) << DTE_HI_SHIFT2) |
+((dte_v2 & DTE_HI_MASK1) << DTE_HI_SHIFT1) |
+(dte_v2 & RK_DTE_PT_ADDRESS_MASK);
+
+   return (phys_addr_t)dte_v2;
+}
+
 static inline bool rk_dte_is_pt_valid(u32 dte)
 {
return dte & RK_DTE_PT_VALID;
@@ -199,6 +226,15 @@ static inline u32 rk_mk_dte(dma_addr_t pt_dma)
return (pt_dma & RK_DTE_PT_ADDRESS_MASK) | RK_DTE_PT_VALID;
 }
 
+static inline u32 rk_mk_dte_v2(dma_addr_t pt_dma)
+{
+   pt_dma = (pt_dma & RK_DTE_PT_ADDRESS_MASK) |
+((pt_dma & PAGE_DESC_HI_MASK1) >> DTE_HI_SHIFT1) |
+(pt_dma & PAGE_DESC_HI_MASK2) >> DTE_HI_SHIFT2;
+
+   return (pt_dma & RK_DTE_PT_ADDRESS_MASK_V2) | RK_DTE_PT_VALID;
+}
+
 /*
  * Each PTE has a Page address, some flags and a valid bit:
  * +-+---+---+-+
@@ -240,6 +276,29 @@ static u32 rk_mk_pte(phys_addr_t page, int prot)
return page | flags | RK_PTE_PAGE_VALID;
 }
 
+/*
+ * In v2:
+ * 31:12 - Page address bit 31:0
+ *  11:9 - Page address bit 34:32
+ *   8:4 - Page address bit 39:35
+ * 3 - Security
+ * 2 - Readable
+ * 1 - Writable
+ * 0 - 1 if Page @ Page address is valid
+ */
+#define RK_PTE_PAGE_READABLE_V2  BIT(2)
+#define RK_PTE_PAGE_WRITABLE_V2  BIT(1)
+
+static u32 rk_mk_pte_v2(phys_addr_t page, int prot)
+{
+   u32 flags = 0;
+
+   flags |= (prot & IOMMU_READ) ? RK_PTE_PAGE_READABLE_V2 : 0;
+   flags |= (prot & IOMMU_WRITE) ? RK_PTE_PAGE_WRITABLE_V2 : 0;
+
+   return rk_mk_dte_v2(page) | flags;
+}
+
 static u32 rk_mk_pte_invalid(u32 pte)
 {
return pte & ~RK_PTE_PAGE_VALID;
@@ -485,6 +544,21 @@ static inline u32 rk_dma_addr_dte(dma_addr_t dt_dma)
return dt_dma;
 }
 
+#define DT_HI_MASK GENMASK_ULL(39, 32)
+#define DT_SHIFT   28
+
+static inline phys_addr_t rk_dte_addr_phys_v2(u32 addr)
+{
+   return (phys_addr_t)(addr & RK_DTE_PT_ADDRESS_MASK) |
+  ((addr & DT_HI_MASK) << DT_SHIFT);
+}
+
+static inline u32 rk_dma_addr_dte_v2(dma_addr_t dt_dma)
+{
+   return (dt_dma & RK_DTE_PT_ADDRESS_MASK) |
+  ((dt_dma & DT_HI_MASK) >> DT_SHIFT);
+}
+
 static void log_iova(struct rk_iommu *iommu, int index, dma_addr_t iova)
 {
void __iomem *base = iommu->bases[index];
@@ -1316,11 +1390,22 @@ static struct rk_iommu_ops iommu_data_ops_v1 = {
.dma_bit_mask = DMA_BIT_MASK(32),
 };
 
+static struct rk_iommu_ops iommu_data_ops_v2 = {
+   .pt_address = &rk_dte_pt_address_v2,
+   .mk_dtentries = &rk_mk_dte_v2,
+   .mk_ptentries = &rk_mk_pte_v2,
+   .dte_addr_phys = &rk_dte_addr_phys_v2,
+   .dma_addr_dte = &rk_dma_addr_dte_v2,
+   .dma_bit_mask = DMA_BIT_MASK(40),
+};
 
 static const struct of_device_id rk_iommu_dt_ids[] = {
{   .compatible = "rockchip,iommu",
.data = &iommu_data_ops_v1,
},
+   {   .compatible = "rockchip,rk3568-iommu",
+   .data = &iommu_data_ops_v2,
+   },
{ /* sentinel */ }
 };
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 2/4] dt-bindings: iommu: rockchip: Add compatible for v2

2021-06-04 Thread Benjamin Gaignard
Add compatible for the second version of IOMMU hardware block.
RK356x IOMMU can also be link to a power domain.

Signed-off-by: Benjamin Gaignard 
Reviewed-by: Rob Herring 
Reviewed-by: Heiko Stuebner 
---
 .../devicetree/bindings/iommu/rockchip,iommu.yaml  | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml 
b/Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml
index 099fc2578b54..d2e28a9e3545 100644
--- a/Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml
+++ b/Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml
@@ -19,7 +19,9 @@ description: |+
 
 properties:
   compatible:
-const: rockchip,iommu
+enum:
+  - rockchip,iommu
+  - rockchip,rk3568-iommu
 
   reg:
 items:
@@ -48,6 +50,9 @@ properties:
   "#iommu-cells":
 const: 0
 
+  power-domains:
+maxItems: 1
+
   rockchip,disable-mmu-reset:
 $ref: /schemas/types.yaml#/definitions/flag
 description: |
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 0/4] Add IOMMU driver for rk356x

2021-06-04 Thread Benjamin Gaignard
This series adds the IOMMU driver for rk356x SoC.
Since a new compatible is needed to distinguish this second version of 
IOMMU hardware block from the first one, it is an opportunity to convert
the binding to DT schema.

version 8:
 - Fix compilation issue.

version 7:
 - Set DMA mask
 - Fix rk_iommu_enable()
 - rebased on v5.13-rc3 tag

version 6:
 - Remove #include 
 - Remove pt_address_mask field
 - Only use once of_device_get_match_data
 - Return an error if ops don't match

version 5:
 - Add internal ops inside the driver to be able to add variants.
 - Add support of v2 variant.
 - Stop using 'version' field
 - Use GENMASK macro

version 4:
 - Add description for reg items
 - Remove useless interrupt-names properties
 - Add description for interrupts items
 - Remove interrupt-names properties from DST files

version 3:
 - Rename compatible with soc prefix
 - Rebase on v5.12 tag

version 2:
 - Fix iommu-cells typo in rk322x.dtsi
 - Change maintainer
 - Change reg maxItems
 - Add power-domains property

Benjamin Gaignard (4):
  dt-bindings: iommu: rockchip: Convert IOMMU to DT schema
  dt-bindings: iommu: rockchip: Add compatible for v2
  iommu: rockchip: Add internal ops to handle variants
  iommu: rockchip: Add support for iommu v2

 .../bindings/iommu/rockchip,iommu.txt |  38 
 .../bindings/iommu/rockchip,iommu.yaml|  85 +
 drivers/iommu/rockchip-iommu.c| 171 +++---
 3 files changed, 234 insertions(+), 60 deletions(-)
 delete mode 100644 Documentation/devicetree/bindings/iommu/rockchip,iommu.txt
 create mode 100644 Documentation/devicetree/bindings/iommu/rockchip,iommu.yaml

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 09:22:43AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Fri, 4 Jun 2021 09:30:37 +0800, Jason Wang  wrote:
> 
> > 在 2021/6/4 上午2:19, Jacob Pan 写道:
> > > Hi Shenming,
> > >
> > > On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu 
> > > wrote:
> > >  
> > >> On 2021/6/2 1:33, Jason Gunthorpe wrote:  
> > >>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> > >>>  
> >  The drivers register per page table fault handlers to /dev/ioasid
> >  which will then register itself to iommu core to listen and route
> >  the per- device I/O page faults.  
> > >>> I'm still confused why drivers need fault handlers at all?  
> > >> Essentially it is the userspace that needs the fault handlers,
> > >> one case is to deliver the faults to the vIOMMU, and another
> > >> case is to enable IOPF on the GPA address space for on-demand
> > >> paging, it seems that both could be specified in/through the
> > >> IOASID_ALLOC ioctl?
> > >>  
> > > I would think IOASID_BIND_PGTABLE is where fault handler should be
> > > registered. There wouldn't be any IO page fault without the binding
> > > anyway.
> > >
> > > I also don't understand why device drivers should register the fault
> > > handler, the fault is detected by the pIOMMU and injected to the
> > > vIOMMU. So I think it should be the IOASID itself register the handler.
> > >  
> > 
> > 
> > As discussed in another thread.
> > 
> > I think the reason is that ATS doesn't forbid the #PF to be reported via 
> > a device specific way.
> 
> Yes, in that case we should support both. Give the device driver a chance
> to handle the IOPF if it can.

Huh?

The device driver does not "handle the IOPF" the device driver might
inject the IOPF.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jacob Pan
Hi Jason,

On Fri, 4 Jun 2021 09:30:37 +0800, Jason Wang  wrote:

> 在 2021/6/4 上午2:19, Jacob Pan 写道:
> > Hi Shenming,
> >
> > On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu 
> > wrote:
> >  
> >> On 2021/6/2 1:33, Jason Gunthorpe wrote:  
> >>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> >>>  
>  The drivers register per page table fault handlers to /dev/ioasid
>  which will then register itself to iommu core to listen and route
>  the per- device I/O page faults.  
> >>> I'm still confused why drivers need fault handlers at all?  
> >> Essentially it is the userspace that needs the fault handlers,
> >> one case is to deliver the faults to the vIOMMU, and another
> >> case is to enable IOPF on the GPA address space for on-demand
> >> paging, it seems that both could be specified in/through the
> >> IOASID_ALLOC ioctl?
> >>  
> > I would think IOASID_BIND_PGTABLE is where fault handler should be
> > registered. There wouldn't be any IO page fault without the binding
> > anyway.
> >
> > I also don't understand why device drivers should register the fault
> > handler, the fault is detected by the pIOMMU and injected to the
> > vIOMMU. So I think it should be the IOASID itself register the handler.
> >  
> 
> 
> As discussed in another thread.
> 
> I think the reason is that ATS doesn't forbid the #PF to be reported via 
> a device specific way.
> 
Yes, in that case we should support both. Give the device driver a chance
to handle the IOPF if it can.

> Thanks
> 
> 
> >  
> >> Thanks,
> >> Shenming
> >>  
> >
> > Thanks,
> >
> > Jacob
> >  
> 


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Paolo Bonzini

On 04/06/21 18:03, Jason Gunthorpe wrote:

On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:

I don't want a security proof myself; I want to trust VFIO to make the right
judgment and I'm happy to defer to it (via the KVM-VFIO device).

Given how KVM is just a device driver inside Linux, VMs should be a slightly
more roundabout way to do stuff that is accessible to bare metal; not a way
to gain extra privilege.


Okay, fine, lets turn the question on its head then.

VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
application can make use of no-snoop optimizations. The ability of KVM
to execute wbinvd should be tied to the ability of that IOCTL to run
in a normal process context.

So, under what conditions do we want to allow VFIO to giave a process
elevated access to the CPU:


Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e. 
#2+#3 would be worse than what we have today), but IIUC the proposal 
(was it yours or Kevin's?) was to keep #2 and add #1 with an 
enable/disable ioctl, which then would be on VFIO and not on KVM.  I 
assumed Alex was more or less okay with it, given he included me in the 
discussion.


If later y'all switch to "it's always okay to issue the enable/disable 
ioctl", I guess the rationale would be documented in the commit message.


Paolo


   1) User has access to a device that can issue no-snoop TLPS
   2) User has access to an IOMMU that can not block no-snoop (today)
   3) Require CAP_SYS_RAW_IO
   4) Anyone


Jason



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> On 04/06/21 17:50, Jason Gunthorpe wrote:
> > > Extending the scenarios where WBINVD is not a nop is not a problem for me.
> > > If possible I wouldn't mind keeping the existing kvm-vfio connection via 
> > > the
> > > device, if only because then the decision remains in the VFIO camp (whose
> > > judgment I trust more than mine on this kind of issue).
> > Really the question to answer is what "security proof" do you want
> > before the wbinvd can be enabled
> 
> I don't want a security proof myself; I want to trust VFIO to make the right
> judgment and I'm happy to defer to it (via the KVM-VFIO device).
> 
> Given how KVM is just a device driver inside Linux, VMs should be a slightly
> more roundabout way to do stuff that is accessible to bare metal; not a way
> to gain extra privilege.

Okay, fine, lets turn the question on its head then.

VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
application can make use of no-snoop optimizations. The ability of KVM
to execute wbinvd should be tied to the ability of that IOCTL to run
in a normal process context.

So, under what conditions do we want to allow VFIO to giave a process
elevated access to the CPU:

> >   1) User has access to a device that can issue no-snoop TLPS
> >   2) User has access to an IOMMU that can not block no-snoop (today)
> >   3) Require CAP_SYS_RAW_IO
> >   4) Anyone

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Paolo Bonzini

On 04/06/21 17:50, Jason Gunthorpe wrote:

Extending the scenarios where WBINVD is not a nop is not a problem for me.
If possible I wouldn't mind keeping the existing kvm-vfio connection via the
device, if only because then the decision remains in the VFIO camp (whose
judgment I trust more than mine on this kind of issue).

Really the question to answer is what "security proof" do you want
before the wbinvd can be enabled


I don't want a security proof myself; I want to trust VFIO to make the 
right judgment and I'm happy to defer to it (via the KVM-VFIO device).


Given how KVM is just a device driver inside Linux, VMs should be a 
slightly more roundabout way to do stuff that is accessible to bare 
metal; not a way to gain extra privilege.


Paolo


  1) User has access to a device that can issue no-snoop TLPS
  2) User has access to an IOMMU that can not block no-snoop (today)
  3) Require CAP_SYS_RAW_IO
  4) Anyone

#1 is an improvement because it allows userspace to enable wbinvd and
no-snoop optimizations based on user choice

#2 is where we are today and wbinvd effectively becomes a fixed
platform choice. Userspace has no say

#3 is "there is a problem, but not so serious, root is powerful
enough to override"


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 05:40:34PM +0200, Paolo Bonzini wrote:
> On 04/06/21 17:26, Alex Williamson wrote:
> > Let's make sure the KVM folks are part of this decision; a re-cap for
> > them, KVM currently automatically enables wbinvd emulation when
> > potentially non-coherent devices are present which is determined solely
> > based on the IOMMU's (or platform's, as exposed via the IOMMU) ability
> > to essentially force no-snoop transactions from a device to be cache
> > coherent.  This synchronization is triggered via the kvm-vfio device,
> > where QEMU creates the device and adds/removes vfio group fd
> > descriptors as an additionally layer to prevent the user from enabling
> > wbinvd emulation on a whim.
> > 
> > IIRC, this latter association was considered a security/DoS issue to
> > prevent a malicious guest/userspace from creating a disproportionate
> > system load.
> > 
> > Where would KVM stand on allowing more direct userspace control of
> > wbinvd behavior?  Would arbitrary control be acceptable or should we
> > continue to require it only in association to a device requiring it for
> > correct operation.
> 
> Extending the scenarios where WBINVD is not a nop is not a problem for me.
> If possible I wouldn't mind keeping the existing kvm-vfio connection via the
> device, if only because then the decision remains in the VFIO camp (whose
> judgment I trust more than mine on this kind of issue).

Really the question to answer is what "security proof" do you want
before the wbinvd can be enabled

 1) User has access to a device that can issue no-snoop TLPS
 2) User has access to an IOMMU that can not block no-snoop (today)
 3) Require CAP_SYS_RAW_IO
 4) Anyone

#1 is an improvement because it allows userspace to enable wbinvd and
no-snoop optimizations based on user choice

#2 is where we are today and wbinvd effectively becomes a fixed
platform choice. Userspace has no say

#3 is "there is a problem, but not so serious, root is powerful
   enough to override"

#4 is "there is no problem here"

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] iommu: Remove unused of_get_dma_window()

2021-06-04 Thread Joerg Roedel
On Thu, May 27, 2021 at 02:37:09PM -0500, Rob Herring wrote:
>  drivers/iommu/of_iommu.c | 68 
>  include/linux/of_iommu.h | 17 ++
>  2 files changed, 3 insertions(+), 82 deletions(-)

Applied both, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Different type iommus integrated in a SoC

2021-06-04 Thread joro
On Thu, Jun 03, 2021 at 01:05:43PM +0100, Robin Murphy wrote:
> Hooray! I've been forecasting this for years, but the cases we regularly hit
> with internal FPGA prototyping (nor the secret unused MMU-400 I found on
> RK3288) have never really been a strong enough argument to stand behind.
> 
> Based on what I remember from looking into this a few years ago, converting
> *most* of the API to per-device ops (now via dev->iommu) is trivial; the
> main challenge will be getting the per-device data bootstrapped in
> iommu_probe_device(), which would probably need to rely on the fwspec and/or
> list of registered IOMMU instances.
> 
> The other notable thing which will need to change is the domain allocation
> interface, but in practice I think everyone who calls iommu_domain_alloc()
> today is in fact doing so for a specific device, so I don't think it's as
> big a problem as it might first appear.

Yeah, I think for that we have to give up on the promise that a domain
can be assigned to _any_ device. But this promise doesn't even hold true
now when there are several IOMMU of the same type but with different
feature sets in a system.

So I happily review patches enabling the Multi-IOMMU SOCs :)

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Paolo Bonzini

On 04/06/21 17:26, Alex Williamson wrote:

Let's make sure the KVM folks are part of this decision; a re-cap for
them, KVM currently automatically enables wbinvd emulation when
potentially non-coherent devices are present which is determined solely
based on the IOMMU's (or platform's, as exposed via the IOMMU) ability
to essentially force no-snoop transactions from a device to be cache
coherent.  This synchronization is triggered via the kvm-vfio device,
where QEMU creates the device and adds/removes vfio group fd
descriptors as an additionally layer to prevent the user from enabling
wbinvd emulation on a whim.

IIRC, this latter association was considered a security/DoS issue to
prevent a malicious guest/userspace from creating a disproportionate
system load.

Where would KVM stand on allowing more direct userspace control of
wbinvd behavior?  Would arbitrary control be acceptable or should we
continue to require it only in association to a device requiring it for
correct operation.


Extending the scenarios where WBINVD is not a nop is not a problem for 
me.  If possible I wouldn't mind keeping the existing kvm-vfio 
connection via the device, if only because then the decision remains in 
the VFIO camp (whose judgment I trust more than mine on this kind of issue).


For example, would it make sense if *VFIO* (not KVM) gets an API that 
says "I am going to do incoherent DMA"?  Then that API causes WBINVD to 
become not-a-nop even on otherwise coherent platforms.  (Would this make 
sense at all without a hypervisor that indirectly lets userspace execute 
WBINVD?  Perhaps VFIO would benefit from a WBINVD ioctl too).


Paolo

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/4] iommu/amd: Enable page-selective flushes

2021-06-04 Thread Joerg Roedel
Hi Nadav,

[Adding Robin]

On Mon, May 24, 2021 at 03:41:55PM -0700, Nadav Amit wrote:
> Nadav Amit (4):
>   iommu/amd: Fix wrong parentheses on page-specific invalidations

This patch is already upstream in v5.13-rc4. Please rebase to that
version.

>   iommu/amd: Selective flush on unmap
>   iommu/amd: Do not sync on page size changes
>   iommu/amd: Do not use flush-queue when NpCache is on

And I think there have been objections from Robin Murphy on Patch 3,
have those been worked out?

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Alex Williamson
On Fri, 4 Jun 2021 09:19:50 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson 
> > Sent: Friday, June 4, 2021 4:42 AM
> >   
> > > 'qemu --allow-no-snoop' makes more sense to me  
> > 
> > I'd be tempted to attach it to the -device vfio-pci option, it's
> > specific drivers for specific devices that are going to want this and
> > those devices may not be permanently attached to the VM.  But I see in
> > the other thread you're trying to optimize IOMMU page table sharing.
> > 
> > There's a usability question in either case though and I'm not sure how
> > to get around it other than QEMU or the kernel knowing a list of
> > devices (explicit IDs or vendor+class) to select per device defaults.
> >   
> 
> "-device vfio-pci" is a per-device option, which implies that the
> no-snoop choice is given to the admin then no need to maintain 
> a fixed device list in Qemu?

I think we want to look at where we put it to have the best default
user experience.  For example the QEMU vfio-pci device option could use
on/off/auto semantics where auto is the default and QEMU maintains a
list of IDs or vendor/class configurations where we've determined the
"optimal" auto configuration.  Management tools could provide an
override, but we're imposing some pretty technical requirements for a
management tool to be able to come up with good per device defaults.
Seems like we should consolidate that technical decision in one place.
Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4] iommu/of: Fix pci_request_acs() before enumerating PCI devices

2021-06-04 Thread Joerg Roedel
On Fri, May 21, 2021 at 03:03:24AM +, Wang Xingang wrote:
> From: Xingang Wang 
> 
> When booting with devicetree, the pci_request_acs() is called after the
> enumeration and initialization of PCI devices, thus the ACS is not
> enabled. And ACS should be enabled when IOMMU is detected for the
> PCI host bridge, so add check for IOMMU before probe of PCI host and call
> pci_request_acs() to make sure ACS will be enabled when enumerating PCI
> devices.
> 
> Fixes: 6bf6c24720d33 ("iommu/of: Request ACS from the PCI core when
> configuring IOMMU linkage")
> Signed-off-by: Xingang Wang 
> ---
>  drivers/iommu/of_iommu.c | 1 -
>  drivers/pci/of.c | 8 +++-
>  2 files changed, 7 insertions(+), 2 deletions(-)

Should probably go through the PCI tree, so

Acked-by: Joerg Roedel 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RESEND PATCH v3] iommu/iova: put free_iova_mem() outside of spinlock iova_rbtree_lock

2021-06-04 Thread Joerg Roedel
On Mon, May 10, 2021 at 07:53:02PM +0800, chenxiang wrote:
> From: Xiang Chen 
> 
> It is not necessary to put free_iova_mem() inside of spinlock/unlock
> iova_rbtree_lock which only leads to more completion for the spinlock.
> It has a small promote on the performance after the change. And also
> rename private_free_iova() as remove_iova() because the function will not
> free iova after that change.
> 
> Signed-off-by: Xiang Chen 
> Reviewed-by: John Garry 
> ---
>  drivers/iommu/iova.c | 18 +++---
>  1 file changed, 11 insertions(+), 7 deletions(-)

Applied, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3] iommu/dma: Fix IOVA reserve dma ranges

2021-06-04 Thread Joerg Roedel
On Mon, Sep 14, 2020 at 12:53:19PM +0530, Srinath Mannam wrote:
> Fix IOVA reserve failure in the case when address of first memory region
> listed in dma-ranges is equal to 0x0.
> 
> Fixes: aadad097cd46f ("iommu/dma: Reserve IOVA for PCIe inaccessible DMA 
> address")
> Signed-off-by: Srinath Mannam 
> ---
> Changes from v2:
>Modify error message with useful information based on Bjorn's comments.
> 
> Changes from v1:
>Removed unnecessary changes based on Robin's review comments.
> 
>  drivers/iommu/dma-iommu.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)

Applied, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Alex Williamson
[Cc +Paolo]

On Fri, 4 Jun 2021 09:28:30 -0300
Jason Gunthorpe  wrote:

> On Fri, Jun 04, 2021 at 08:38:26AM +, Tian, Kevin wrote:
> > > I think more to drive the replacement design; if we can't figure out
> > > how to do something other than backwards compatibility trickery in the
> > > kernel, it's probably going to bite us.  Thanks,  
> > 
> > I'm a bit lost on the desired flow in your minds. Here is one flow based
> > on my understanding of this discussion. Please comment whether it
> > matches your thinking:
> > 
> > 0) ioasid_fd is created and registered to KVM via KVM_ADD_IOASID_FD;
> > 
> > 1) Qemu binds dev1 to ioasid_fd;
> > 
> > 2) Qemu calls IOASID_GET_DEV_INFO for dev1. This will carry IOMMU_
> >  CACHE info i.e. whether underlying IOMMU can enforce snoop;
> > 
> > 3) Qemu plans to create a gpa_ioasid, and attach dev1 to it. Here Qemu
> > needs to figure out whether dev1 wants to do no-snoop. This might
> > be based a fixed vendor/class list or specified by user;
> > 
> > 4) gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); At this point a 'snoop'
> >  flag is specified to decide the page table format, which is supposed
> >  to match dev1;  
> 
> > 5) Qemu attaches dev1 to gpa_ioasid via VFIO_ATTACH_IOASID. At this 
> >  point, specify snoop/no-snoop again. If not supported by related 
> >  iommu or different from what gpa_ioasid has, attach fails.  
> 
> Why do we need to specify it again?

My thought as well.

> If the IOASID was created with the "block no-snoop" flag then it is
> blocked in that IOASID, and that blocking sets the page table format.
> 
> The only question is if we can successfully attach a device to the
> page table, or not.
> 
> The KVM interface is a bit tricky because Alex said this is partially
> security, wbinvd is only enabled if someone has a FD to a device that
> can support no-snoop. 
> 
> Personally I think this got way too complicated, the KVM interface
> should simply be
> 
> ioctl(KVM_ALLOW_INCOHERENT_DMA, ioasidfd, device_label)
> ioctl(KVM_DISALLOW_INCOHERENT_DMA, ioasidfd, device_label)
> 
> and let qemu sort it out based on command flags, detection, whatever.
> 
> 'ioasidfd, device_label' is the security proof that Alex asked
> for. This needs to be some device in the ioasidfd that declares it is
> capabale of no-snoop. Eg vfio_pci would always declare it is capable
> of no-snoop.
> 
> No kernel call backs, no kernel auto-sync/etc. If qemu mismatches the
> IOASID block no-snoop flag with the KVM_x_INCOHERENT_DMA state then it
> is just a kernel-harmless uerspace bug.
> 
> Then user space can decide which of the various axis's it wants to
> optimize for.

Let's make sure the KVM folks are part of this decision; a re-cap for
them, KVM currently automatically enables wbinvd emulation when
potentially non-coherent devices are present which is determined solely
based on the IOMMU's (or platform's, as exposed via the IOMMU) ability
to essentially force no-snoop transactions from a device to be cache
coherent.  This synchronization is triggered via the kvm-vfio device,
where QEMU creates the device and adds/removes vfio group fd
descriptors as an additionally layer to prevent the user from enabling
wbinvd emulation on a whim.

IIRC, this latter association was considered a security/DoS issue to
prevent a malicious guest/userspace from creating a disproportionate
system load.

Where would KVM stand on allowing more direct userspace control of
wbinvd behavior?  Would arbitrary control be acceptable or should we
continue to require it only in association to a device requiring it for
correct operation.

A wrinkle in "correct operation" is that while the IOMMU may be able to
force no-snoop transactions to be coherent, in the scenario described
in the previous reply, the user may intend to use non-coherent DMA
regardless of the IOMMU capabilities due to their own optimization
policy.  There's a whole spectrum here, including aspects we can't
determine around the device driver's intentions to use non-coherent
transactions, the user's policy in trading hypervisor overhead for
cache coherence overhead, etc.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/amd: Tidy up DMA ops init

2021-06-04 Thread Joerg Roedel
On Thu, Jun 03, 2021 at 02:48:21PM +0100, Robin Murphy wrote:
> As discussed on the report thread, I think it makes most sense to merge
> this as a fix for 5.13 and not worry about any backporting.
> 
>  drivers/iommu/amd/amd_iommu.h |  2 --
>  drivers/iommu/amd/init.c  |  5 -
>  drivers/iommu/amd/iommu.c | 31 +--
>  3 files changed, 13 insertions(+), 25 deletions(-)

Applied for v5.13, thanks Robin et al for the quick work on this
regression.

Robin, do you by chance have a Fixes tag which I can add?

Thanks,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 2/6] ACPI: Move IOMMU setup code out of IORT

2021-06-04 Thread Joerg Roedel
On Thu, Jun 03, 2021 at 09:26:39AM +0200, Jean-Philippe Brucker wrote:
> These are only defined when CONFIG_IOMMU_API is set. IORT uses them inside
> an #ifdef, I can do the same. Maybe moving these two functions to a new
> drivers/acpi/iommu.c would be nicer, though.

Not sure what the ACPI maintainers and reviewers prefer, but I would
just #ifdef the functions and provide stubs in the #else path if
necessary.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/rockchip: Remove redundant DMA syncs

2021-06-04 Thread Joerg Roedel
On Fri, May 21, 2021 at 02:49:39PM +0100, Robin Murphy wrote:
> When we allocate new pagetable pages, we don't modify them between the
> initial dma_map_single() call and the dma_sync_single_for_device() call
> via rk_iommu_flush(), so the latter is entirely pointless.
> 
> Signed-off-by: Robin Murphy 
> ---
>  drivers/iommu/rockchip-iommu.c | 3 ---
>  1 file changed, 3 deletions(-)

Applied, thanks Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 0/4] Add IOMMU driver for rk356x

2021-06-04 Thread Joerg Roedel
On Fri, Jun 04, 2021 at 04:54:58PM +0200, Joerg Roedel wrote:
> On Tue, May 25, 2021 at 02:15:47PM +0200, Benjamin Gaignard wrote:
> > Benjamin Gaignard (4):
> >   dt-bindings: iommu: rockchip: Convert IOMMU to DT schema
> >   dt-bindings: iommu: rockchip: Add compatible for v2
> >   iommu: rockchip: Add internal ops to handle variants
> 
> Applied patches 1-3, thanks.

Hmm, no, unapplied. Your separate patch 4 came before the kbuild-bot
reports. Can you fix those please and submit a v8?

Thanks,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 0/4] Add IOMMU driver for rk356x

2021-06-04 Thread Joerg Roedel
On Tue, May 25, 2021 at 02:15:47PM +0200, Benjamin Gaignard wrote:
> Benjamin Gaignard (4):
>   dt-bindings: iommu: rockchip: Convert IOMMU to DT schema
>   dt-bindings: iommu: rockchip: Add compatible for v2
>   iommu: rockchip: Add internal ops to handle variants

Applied patches 1-3, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/amd: Remove redundant assignment of err

2021-06-04 Thread Joerg Roedel
On Wed, May 19, 2021 at 11:37:27AM +0800, Shaokun Zhang wrote:
> 'err' will be initialized and cleanup the redundant initialization.
> 
> Cc: Joerg Roedel  
> Signed-off-by: Shaokun Zhang 

Applied, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RESEND PATCH v2] iommu/amd: Fix extended features logging

2021-06-04 Thread Joerg Roedel
On Tue, May 04, 2021 at 01:22:20PM +0300, Alexander Monakov wrote:
> Fixes: 9a295ff0ffc9 ("iommu/amd: Print extended features in one line to fix 
> divergent log levels")
> Link: 
> https://lore.kernel.org/lkml/alpine.lnx.2.20.13.2104112326460.11...@monopod.intra.ispras.ru
> Signed-off-by: Alexander Monakov 
> Cc: Paul Menzel 
> Cc: Joerg Roedel 
> Cc: Suravee Suthikulpanit 
> Cc: iommu@lists.linux-foundation.org
> ---
>  drivers/iommu/amd/init.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Applied, thanks. And sorry for the delay.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 06:08:28AM +, Tian, Kevin wrote:

> In Qemu case the problem is that it doesn't know the list of devices
> that will be attached to an IOASID when it's created. This is a guest-
> side knowledge which is conveyed one device at a time to Qemu 
> though vIOMMU.

At least for the guest side it is alot simpler because the vIOMMU
being emulated will define nearly everything. 

qemu will just have to ask the kernel for whatever it is the guest is
doing. If the kernel can't do it then qemu has to SW emulate.

The no-snoop block may be the only thing that is under qemu's control
because it is transparent to the guest.

This will probably become clearer as people start to define what the
get_info should return.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 12:44:28PM +0200, Enrico Weigelt, metux IT consult 
wrote:
> On 02.06.21 19:24, Jason Gunthorpe wrote:
> 
> Hi,
> 
> >> If I understand this correctly, /dev/ioasid is a kind of "common
> supplier"
> >> to other APIs / devices. Why can't the fd be acquired by the
> >> consumer APIs (eg. kvm, vfio, etc) ?
> >
> > /dev/ioasid would be similar to /dev/vfio, and everything already
> > deals with exposing /dev/vfio and /dev/vfio/N together
> >
> > I don't see it as a problem, just more work.
> 
> One of the problems I'm seeing is in container environments: when
> passing in an vfio device, we now also need to pass in /dev/ioasid,
> thus increasing the complexity in container setup (or orchestration).

Containers already needed to do this today. Container orchestration is
hard.

> And in such scenarios you usually want to pass in one specific device,
> not all of the same class, and usually orchestration shall pick the
> next free one.
> 
> Can we make sure that a process having full access to /dev/ioasid
> while only supposed to have to specific consumer devices, can't do
> any harm (eg. influencing other containers that might use a different
> consumer device) ?

Yes, /dev/ioasid shouldn't do anything unless you have a device to
connect it with. In this way it is probably safe to stuff it into
every container.

> > Having FDs spawn other FDs is pretty ugly, it defeats the "everything
> > is a file" model of UNIX.
> 
> Unfortunately, this is already defeated in many other places :(
> (I'd even claim that ioctls already break it :p)

I think you are reaching a bit :)

> It seems your approach also breaks this, since we now need to open two
> files in order to talk to one device.

It is two devices, thus two files.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 08:38:26AM +, Tian, Kevin wrote:
> > I think more to drive the replacement design; if we can't figure out
> > how to do something other than backwards compatibility trickery in the
> > kernel, it's probably going to bite us.  Thanks,
> 
> I'm a bit lost on the desired flow in your minds. Here is one flow based
> on my understanding of this discussion. Please comment whether it
> matches your thinking:
> 
> 0) ioasid_fd is created and registered to KVM via KVM_ADD_IOASID_FD;
> 
> 1) Qemu binds dev1 to ioasid_fd;
> 
> 2) Qemu calls IOASID_GET_DEV_INFO for dev1. This will carry IOMMU_
>  CACHE info i.e. whether underlying IOMMU can enforce snoop;
> 
> 3) Qemu plans to create a gpa_ioasid, and attach dev1 to it. Here Qemu
> needs to figure out whether dev1 wants to do no-snoop. This might
> be based a fixed vendor/class list or specified by user;
> 
> 4) gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); At this point a 'snoop'
>  flag is specified to decide the page table format, which is supposed
>  to match dev1;

> 5) Qemu attaches dev1 to gpa_ioasid via VFIO_ATTACH_IOASID. At this 
>  point, specify snoop/no-snoop again. If not supported by related 
>  iommu or different from what gpa_ioasid has, attach fails.

Why do we need to specify it again?

If the IOASID was created with the "block no-snoop" flag then it is
blocked in that IOASID, and that blocking sets the page table format.

The only question is if we can successfully attach a device to the
page table, or not.

The KVM interface is a bit tricky because Alex said this is partially
security, wbinvd is only enabled if someone has a FD to a device that
can support no-snoop. 

Personally I think this got way too complicated, the KVM interface
should simply be

ioctl(KVM_ALLOW_INCOHERENT_DMA, ioasidfd, device_label)
ioctl(KVM_DISALLOW_INCOHERENT_DMA, ioasidfd, device_label)

and let qemu sort it out based on command flags, detection, whatever.

'ioasidfd, device_label' is the security proof that Alex asked
for. This needs to be some device in the ioasidfd that declares it is
capabale of no-snoop. Eg vfio_pci would always declare it is capable
of no-snoop.

No kernel call backs, no kernel auto-sync/etc. If qemu mismatches the
IOASID block no-snoop flag with the KVM_x_INCOHERENT_DMA state then it
is just a kernel-harmless uerspace bug.

Then user space can decide which of the various axis's it wants to
optimize for.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jason Gunthorpe
On Thu, Jun 03, 2021 at 02:41:36PM -0600, Alex Williamson wrote:

> Could you clarify "vfio_driver"?  

This is the thing providing the vfio_device_ops function pointers.

So vfio-pci can't know anything about this (although your no-snoop
control probing idea makes sense to me)

But vfio_mlx5_pci can know

So can mdev_idxd

And kvmgt

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 06:37:26AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Thursday, June 3, 2021 9:05 PM
> > 
> > > >
> > > > 3) Device accepts any PASIDs from the guest. No
> > > >vPASID/pPASID translation is possible. (classic vfio_pci)
> > > > 4) Device accepts any PASID from the guest and has an
> > > >internal vPASID/pPASID translation (enhanced vfio_pci)
> > >
> > > what is enhanced vfio_pci? In my writing this is for mdev
> > > which doesn't support ENQCMD
> > 
> > This is a vfio_pci that mediates some element of the device interface
> > to communicate the vPASID/pPASID table to the device, using Max's
> > series for vfio_pci drivers to inject itself into VFIO.
> > 
> > For instance a device might send a message through the PF that the VF
> > has a certain vPASID/pPASID translation table. This would be useful
> > for devices that cannot use ENQCMD but still want to support migration
> > and thus need vPASID.
> 
> I still don't quite get. If it's a PCI device why is PASID translation 
> required?
> Just delegate the per-RID PASID space to user as type-3 then migrating the 
> vPASID space is just straightforward.

This is only possible if we get rid of the global pPASID allocation
(honestly is my preference as it makes the HW a lot simpler)

Without vPASID the migration would need pPASID's on the RID that are
guarenteed free.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote:

> I think once it binds a device to an IOASID fd, QEMU will want to probe
> what hardware features are available before going further with the vIOMMU
> setup (is there PASID, PRI, which page table formats are supported,

I think David's point was that qemu should be told what vIOMMU it is
emulating exactly (right down to what features it has) and then
the goal is simply to match what the vIOMMU needs with direct HW
support via /dev/ioasid and fall back to SW emulation when not
possible.

If qemu wants to have some auto-configuration: 'pass host IOMMU
capabilities' similar to the CPU flags then qemu should probe the
/dev/ioasid - and maybe we should just return some highly rolled up
"this is IOMMU HW ID ARM SMMU vXYZ" out of some query to guide qemu in
doing this.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 09:11:03AM +0800, Jason Wang wrote:
> > nor do any virtio drivers implement the required platform specific
> > cache flushing to make no-snoop TLPs work.
> 
> I don't get why virtio drivers needs to do that. I think DMA API should hide
> those arch/platform specific stuffs from us.

It is not arch/platform stuff. If the device uses no-snoop then a
very platform specific recovery is required in the device driver.

It is not part of the normal DMA API, it is side APIs like
flush_agp_cache() or wbinvd() that are used by GPU drivers only.

If drivers/virtio doesn't explicitly call these things it doesn't
support no-snoop - hence no VDPA device can ever use no-snoop.

Since VIRTIO_F_ACCESS_PLATFORM doesn't trigger wbinvd on x86 it has
nothing to do with no-snoop.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Enrico Weigelt, metux IT consult

On 02.06.21 19:24, Jason Gunthorpe wrote:

Hi,

>> If I understand this correctly, /dev/ioasid is a kind of "common 
supplier"

>> to other APIs / devices. Why can't the fd be acquired by the
>> consumer APIs (eg. kvm, vfio, etc) ?
>
> /dev/ioasid would be similar to /dev/vfio, and everything already
> deals with exposing /dev/vfio and /dev/vfio/N together
>
> I don't see it as a problem, just more work.

One of the problems I'm seeing is in container environments: when
passing in an vfio device, we now also need to pass in /dev/ioasid,
thus increasing the complexity in container setup (or orchestration).

And in such scenarios you usually want to pass in one specific device,
not all of the same class, and usually orchestration shall pick the
next free one.

Can we make sure that a process having full access to /dev/ioasid
while only supposed to have to specific consumer devices, can't do
any harm (eg. influencing other containers that might use a different
consumer device) ?

Note that we don't have device namespaces yet (device isolation still
has to be done w/ complicated bpf magic). I'm already working on that,
but even "simple" things like loopdev allocation turns out to be not
entirely easy.

> Having FDs spawn other FDs is pretty ugly, it defeats the "everything
> is a file" model of UNIX.

Unfortunately, this is already defeated in many other places :(
(I'd even claim that ioctls already break it :p)

It seems your approach also breaks this, since we now need to open two
files in order to talk to one device.

By the way: my idea does keep the "everything's a file" concept - we
just have a file that allows opening "sub-files". Well, it would be
better if devices could also have directory semantics.


--mtx

---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jean-Philippe Brucker
On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote:
> > > But it would certainly be possible for a system to have two
> > > different host bridges with two different IOMMUs with different
> > > pagetable formats.  Until you know which devices (and therefore
> > > which host bridge) you're talking about, you don't know what formats
> > > of pagetable to accept.  And if you have devices from *both* bridges
> > > you can't bind a page table at all - you could theoretically support
> > > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > > in both formats, but it would be pretty reasonable not to support
> > > that.
> > 
> > The basic process for a user space owned pgtable mode would be:
> > 
> >  1) qemu has to figure out what format of pgtable to use
> > 
> > Presumably it uses query functions using the device label.
> 
> No... in the qemu case it would always select the page table format
> that it needs to present to the guest.  That's part of the
> guest-visible platform that's selected by qemu's configuration.
> 
> There's no negotiation here: either the kernel can supply what qemu
> needs to pass to the guest, or it can't.  If it can't qemu, will have
> to either emulate in SW (if possible, probably using a kernel-managed
> IOASID to back it) or fail outright.
> 
> > The
> > kernel code should look at the entire device path through all the
> > IOMMU HW to determine what is possible.
> > 
> > Or it already knows because the VM's vIOMMU is running in some
> > fixed page table format, or the VM's vIOMMU already told it, or
> > something.
> 
> Again, I think you have the order a bit backwards.  The user selects
> the capabilities that the vIOMMU will present to the guest as part of
> the qemu configuration.  Qemu then requests that of the host kernel,
> and either the host kernel supplies it, qemu emulates it in SW, or
> qemu fails to start.

Hm, how fine a capability are we talking about?  If it's just "give me
VT-d capabilities" or "give me Arm capabilities" that would work, but
probably isn't useful. Anything finer will be awkward because userspace
will have to try combinations of capabilities to see what sticks, and
supporting new hardware will drop compatibility for older one.

For example depending whether the hardware IOMMU is SMMUv2 or SMMUv3, that
completely changes the capabilities offered to the guest (some v2
implementations support nesting page tables, but never PASID nor PRI
unlike v3.) The same vIOMMU could support either, presenting different
capabilities to the guest, even multiple page table formats if we wanted
to be exhaustive (SMMUv2 supports the older 32-bit descriptor), but it
needs to know early on what the hardware is precisely. Then some new page
table format shows up and, although the vIOMMU can support that in
addition to older ones, QEMU will have to pick a single one, that it
assumes the guest knows how to drive?

I think once it binds a device to an IOASID fd, QEMU will want to probe
what hardware features are available before going further with the vIOMMU
setup (is there PASID, PRI, which page table formats are supported,
address size, page granule, etc). Obtaining precise information about the
hardware would be less awkward than trying different configurations until
one succeeds. Binding an additional device would then fail if its pIOMMU
doesn't support exactly the features supported for the first device,
because we don't know which ones the guest will choose. QEMU will have to
open a new IOASID fd for that device.

Thanks,
Jean

> 
> Guest visible properties of the platform never (or *should* never)
> depend implicitly on host capabilities - it's impossible to sanely
> support migration in such an environment.
> 
> >  2) qemu creates an IOASID and based on #1 and says 'I want this format'
> 
> Right.
> 
> >  3) qemu binds the IOASID to the device. 
> > 
> > If qmeu gets it wrong then it just fails.
> 
> Right, though it may be fall back to (partial) software emulation.  In
> practice that would mean using a kernel-managed IOASID and walking the
> guest IO pagetables itself to mirror them into the host kernel.
> 
> >  4) For the next device qemu would have to figure out if it can re-use
> > an existing IOASID based on the required proeprties.
> 
> Nope.  Again, what devices share an IO address space is a guest
> visible part of the platform.  If the host kernel can't supply that,
> then qemu must not start (or fail the hotplug if the new device is
> being hotplugged).
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Friday, June 4, 2021 4:42 AM
> 
> > 'qemu --allow-no-snoop' makes more sense to me
> 
> I'd be tempted to attach it to the -device vfio-pci option, it's
> specific drivers for specific devices that are going to want this and
> those devices may not be permanently attached to the VM.  But I see in
> the other thread you're trying to optimize IOMMU page table sharing.
> 
> There's a usability question in either case though and I'm not sure how
> to get around it other than QEMU or the kernel knowing a list of
> devices (explicit IDs or vendor+class) to select per device defaults.
> 

"-device vfio-pci" is a per-device option, which implies that the
no-snoop choice is given to the admin then no need to maintain 
a fixed device list in Qemu?

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Tian, Kevin
> From: Jean-Philippe Brucker 
> Sent: Friday, June 4, 2021 4:18 PM
> 
> On Wed, Jun 02, 2021 at 01:25:00AM +, Tian, Kevin wrote:
> > > > This implies that VFIO_BOUND_IOASID will be extended to allow user
> > > > specify a device label. This label will be recorded in /dev/iommu to
> > > > serve per-device invalidation request from and report per-device
> > > > fault data to the user.
> > >
> > > I wonder which of the user providing a 64 bit cookie or the kernel
> > > returning a small IDA is the best choice here? Both have merits
> > > depending on what qemu needs..
> >
> > Yes, either way can work. I don't have a strong preference. Jean?
> 
> I don't see an issue with either solution, maybe it will show up while
> prototyping. First one uses IDs that do mean something for someone, and
> userspace may inject faults slightly faster since it doesn't need an
> ID->vRID lookup, so that's my preference.

ok, will go for the first option in v2.

> 
> > > > In addition, vPASID (if provided by user) will
> > > > be also recorded in /dev/iommu so vPASID<->pPASID conversion
> > > > is conducted properly. e.g. invalidation request from user carries
> > > > a vPASID which must be converted into pPASID before calling iommu
> > > > driver. Vice versa for raw fault data which carries pPASID while the
> > > > user expects a vPASID.
> > >
> > > I don't think the PASID should be returned at all. It should return
> > > the IOASID number in the FD and/or a u64 cookie associated with that
> > > IOASID. Userspace should figure out what the IOASID & device
> > > combination means.
> >
> > This is true for Intel. But what about ARM which has only one IOASID
> > (pasid table) per device to represent all guest I/O page tables?
> 
> In that case vPASID = pPASID though. The vPASID allocated by the guest is
> the same from the vIOMMU inval to the pIOMMU inval. I don't think host
> kernel or userspace need to alter it.
> 

yes. So responding to Jason's earlier comment we do need return
PASID (although no conversion is required) to userspace in this
case. 😊

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Friday, June 4, 2021 5:44 AM
> 
> > Based on that observation we can say as soon as the user wants to use
> > an IOMMU that does not support DMA_PTE_SNP in the guest we can still
> > share the IO page table with IOMMUs that do support DMA_PTE_SNP.

page table sharing between incompatible IOMMUs is not a critical
thing. I prefer to disallowing sharing in such case as the starting point,
i.e. the user needs to create separate IOASIDs for such devices.

> 
> If your goal is to prioritize IO page table sharing, sure.  But because
> we cannot atomically transition from one to the other, each device is
> stuck with the pages tables it has, so the history of the VM becomes a
> factor in the performance characteristics.
> 
> For example if device {A} is backed by an IOMMU capable of blocking
> no-snoop and device {B} is backed by an IOMMU which cannot block
> no-snoop, then booting VM1 with {A,B} and later removing device {B}
> would result in ongoing wbinvd emulation versus a VM2 only booted with
> {A}.
> 
> Type1 would use separate IO page tables (domains/ioasids) for these such
> that VM1 and VM2 have the same characteristics at the end.
> 
> Does this become user defined policy in the IOASID model?  There's
> quite a mess of exposing sufficient GET_INFO for an IOASID for the user
> to know such properties of the IOMMU, plus maybe we need mapping flags
> equivalent to IOMMU_CACHE exposed to the user, preventing sharing an
> IOASID that could generate IOMMU faults, etc.

IOMMU_CACHE is a fixed attribute given an IOMMU. So it's better to
convey this info to userspace via GET_INFO for a device_label, before 
creating any IOASID. But overall I agree that careful thinking is required
about how to organize those info reporting (per-fd, per-device, per-ioasid)
to userspace.

> 
> > > > It doesn't solve the problem to connect kvm to AP and kvmgt though
> > >
> > > It does not, we'll probably need a vfio ioctl to gratuitously announce
> > > the KVM fd to each device.  I think some devices might currently fail
> > > their open callback if that linkage isn't already available though, so
> > > it's not clear when that should happen, ie. it can't currently be a
> > > VFIO_DEVICE ioctl as getting the device fd requires an open, but this
> > > proposal requires some availability of the vfio device fd without any
> > > setup, so presumably that won't yet call the driver open callback.
> > > Maybe that's part of the attach phase now... I'm not sure, it's not
> > > clear when the vfio device uAPI starts being available in the process
> > > of setting up the ioasid.  Thanks,
> >
> > At a certain point we maybe just have to stick to backward compat, I
> > think. Though it is useful to think about green field alternates to
> > try to guide the backward compat design..
> 
> I think more to drive the replacement design; if we can't figure out
> how to do something other than backwards compatibility trickery in the
> kernel, it's probably going to bite us.  Thanks,
> 

I'm a bit lost on the desired flow in your minds. Here is one flow based
on my understanding of this discussion. Please comment whether it
matches your thinking:

0) ioasid_fd is created and registered to KVM via KVM_ADD_IOASID_FD;

1) Qemu binds dev1 to ioasid_fd;

2) Qemu calls IOASID_GET_DEV_INFO for dev1. This will carry IOMMU_
 CACHE info i.e. whether underlying IOMMU can enforce snoop;

3) Qemu plans to create a gpa_ioasid, and attach dev1 to it. Here Qemu
needs to figure out whether dev1 wants to do no-snoop. This might
be based a fixed vendor/class list or specified by user;

4) gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); At this point a 'snoop'
 flag is specified to decide the page table format, which is supposed
 to match dev1;

5) Qemu attaches dev1 to gpa_ioasid via VFIO_ATTACH_IOASID. At this 
 point, specify snoop/no-snoop again. If not supported by related 
 iommu or different from what gpa_ioasid has, attach fails.

6) call KVM to update the snoop requirement via KVM_UPADTE_IOASID_FD.
this triggers ioasidfd_for_each_ioasid();

later when dev2 is attached to gpa_ioasid, same flow is followed. This
implies that KVM_UPDATE_IOASID_FD is called only when new IOASID is
created or existing IOASID is destroyed, because all devices under an 
IOASID should have the same snoop requirement.

Thanks
Kevin
 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v10 1/3] iommu: Enhance IOMMU default DMA mode build options

2021-06-04 Thread John Garry



  
  	  If unsure, say N here.
  
+choice

+   prompt "IOMMU default DMA mode"
+   depends on IOMMU_API


config INTEL_IOMMU
 depends on PCI_MSI && ACPI && (X86 || IA64)

config AMD_IOMMU
 depends on X86_64 && PCI && ACPI && HAVE_CMPXCHG_DOUBLE

config ARM_SMMU_V3
 depends on ARM64

"depends on ARM_SMMU_V3 || INTEL_IOMMU || AMD_IOMMU" may need to be added, 
because it doesn't work on other platforms.

or "depends on X86 || IA64 || X86_64 || ARM64"


I suppose so. But I don't think that any other archs use the value from 
iommu_dma_strict anyway.





+
+   default IOMMU_DEFAULT_STRICT
+   help
+ This option allows an IOMMU DMA mode to be chosen at build time, to
+ override the default DMA mode of each ARCH, removing the need to
+ pass in kernel parameters through command line. It is still possible
+ to provide ARCH-specific or common boot options to override this
+ option.
+
+ If unsure, keep the default.
+
+config IOMMU_DEFAULT_LAZY
+   bool "lazy"
+   help
+ Support lazy mode, where for every IOMMU DMA unmap operation, the
+ flush operation of IOTLB and the free operation of IOVA are deferred.
+ They are only guaranteed to be done before the related IOVA will be
+ reused.
+
+config IOMMU_DEFAULT_STRICT
+   bool "strict"
+   help
+ For every IOMMU DMA unmap operation, the flush operation of IOTLB and
+ the free operation of IOVA are guaranteed to be done in the unmap
+ function.
+
+ This mode is safer than the two above, but it maybe slower in some
+ high performace scenarios.
+
+endchoice
+
  config OF_IOMMU
def_bool y
depends on OF && IOMMU_API
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 966426a96520..177b0dafc535 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -29,7 +29,8 @@ static struct kset *iommu_group_kset;
  static DEFINE_IDA(iommu_group_ida);
  
  static unsigned int iommu_def_domain_type __read_mostly;

-static bool iommu_dma_strict __read_mostly = true;
+static bool iommu_dma_strict __read_mostly =
+   IS_ENABLED(CONFIG_IOMMU_DEFAULT_STRICT);


Currently, a maximum of 100 columns are allowed in a row.


Sure, but some people still prefer limiting to 80, and codingstyle
doc mentions a preference for 80.

But if you really think I should change, then that's ok.

Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jean-Philippe Brucker
On Wed, Jun 02, 2021 at 01:25:00AM +, Tian, Kevin wrote:
> > > This implies that VFIO_BOUND_IOASID will be extended to allow user
> > > specify a device label. This label will be recorded in /dev/iommu to
> > > serve per-device invalidation request from and report per-device
> > > fault data to the user.
> > 
> > I wonder which of the user providing a 64 bit cookie or the kernel
> > returning a small IDA is the best choice here? Both have merits
> > depending on what qemu needs..
> 
> Yes, either way can work. I don't have a strong preference. Jean?

I don't see an issue with either solution, maybe it will show up while
prototyping. First one uses IDs that do mean something for someone, and
userspace may inject faults slightly faster since it doesn't need an
ID->vRID lookup, so that's my preference.

> > > In addition, vPASID (if provided by user) will
> > > be also recorded in /dev/iommu so vPASID<->pPASID conversion
> > > is conducted properly. e.g. invalidation request from user carries
> > > a vPASID which must be converted into pPASID before calling iommu
> > > driver. Vice versa for raw fault data which carries pPASID while the
> > > user expects a vPASID.
> > 
> > I don't think the PASID should be returned at all. It should return
> > the IOASID number in the FD and/or a u64 cookie associated with that
> > IOASID. Userspace should figure out what the IOASID & device
> > combination means.
> 
> This is true for Intel. But what about ARM which has only one IOASID
> (pasid table) per device to represent all guest I/O page tables?

In that case vPASID = pPASID though. The vPASID allocated by the guest is
the same from the vIOMMU inval to the pIOMMU inval. I don't think host
kernel or userspace need to alter it.

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, June 3, 2021 8:41 PM
> 
> > When discussing I/O page fault support in another thread, the consensus
> > is that an device handle will be registered (by user) or allocated (return
> > to user) in /dev/ioasid when binding the device to ioasid fd. From this
> > angle we can register {ioasid_fd, device_handle} to KVM and then call
> > something like ioasidfd_device_is_coherent() to get the property.
> > Anyway the coherency is a per-device property which is not changed
> > by how many I/O page tables are attached to it.
> 
> It is not device specific, it is driver specific
> 
> As I said before, the question is if the IOASID itself can enforce
> snoop, or not. AND if the device will issue no-snoop or not.

Sure. My earlier comment was based on the assumption that all IOASIDs
attached to a device should inherit the same snoop/no-snoop fact. But
looks it doesn't prevent a device driver from setting PTE_SNP only for
selected I/O page tables, according to whether isoch agents are involved.

An user space driver could figure out per-IOASID requirements itself.

A guest device driver can indirectly convey this information through 
vIOMMU.

Registering {IOASID_FD, IOASID} to KVM has another merit, as we also
need it to update CPU PASID mapping for ENQCMD. We can define
one interface for both requirements. 😊

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu