RE: [PATCH v1 1/2] vfio/pci: Expose PCIe PASID capability to guest

2020-03-30 Thread Liu, Yi L
> From: Tian, Kevin 
> Sent: Tuesday, March 31, 2020 2:39 PM
> To: Liu, Yi L ; alex.william...@redhat.com;
> Subject: RE: [PATCH v1 1/2] vfio/pci: Expose PCIe PASID capability to guest
> 
> > From: Liu, Yi L 
> > Sent: Sunday, March 22, 2020 8:33 PM
> >
> > From: Liu Yi L 
> >
> > This patch exposes PCIe PASID capability to guest. Existing vfio_pci
> > driver hides it from guest by setting the capability length as 0 in
> > pci_ext_cap_length[].
> >
> > This capability is required for vSVA enabling on pass-through PCIe
> > devices.
> 
> should this be [PATCH 2/2], after you have the emulation in place?

oh, yes, I can re-sequence it.

> and it might be worthy of noting that PRI is already exposed, to
> avoid confusion from one like me that why two capabilities are
> emulated in this series while only one is being exposed. 😊

got it. it would be helpful. thanks.

Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v1 1/2] vfio/pci: Expose PCIe PASID capability to guest

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:33 PM
> 
> From: Liu Yi L 
> 
> This patch exposes PCIe PASID capability to guest. Existing vfio_pci
> driver hides it from guest by setting the capability length as 0 in
> pci_ext_cap_length[].
> 
> This capability is required for vSVA enabling on pass-through PCIe
> devices.

should this be [PATCH 2/2], after you have the emulation in place?

and it might be worthy of noting that PRI is already exposed, to
avoid confusion from one like me that why two capabilities are
emulated in this series while only one is being exposed. 😊

> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/pci/vfio_pci_config.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c
> b/drivers/vfio/pci/vfio_pci_config.c
> index 90c0b80..4b9af99 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -95,7 +95,7 @@ static const u16
> pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = {
>   [PCI_EXT_CAP_ID_LTR]=   PCI_EXT_CAP_LTR_SIZEOF,
>   [PCI_EXT_CAP_ID_SECPCI] =   0,  /* not yet */
>   [PCI_EXT_CAP_ID_PMUX]   =   0,  /* not yet */
> - [PCI_EXT_CAP_ID_PASID]  =   0,  /* not yet */
> + [PCI_EXT_CAP_ID_PASID]  =   PCI_EXT_CAP_PASID_SIZEOF,
>  };
> 
>  /*
> --
> 2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v1 0/2] vfio/pci: expose device's PASID capability to VMs

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:33 PM
> 
> From: Liu Yi L 
> 
> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> Intel platforms allows address space sharing between device DMA and
> applications. SVA can reduce programming complexity and enhance security.
> 
> To enable SVA, device needs to have PASID capability, which is a key
> capability for SVA. This patchset exposes the device's PASID capability
> to guest instead of hiding it from guest.
> 
> The second patch emulates PASID capability for VFs (Virtual Function) since
> VFs don't implement such capability per PCIe spec. This patch emulates such
> capability and expose to VM if the capability is enabled in PF (Physical
> Function).
> 
> However, there is an open for PASID emulation. If PF driver disables PASID
> capability at runtime, then it may be an issue. e.g. PF should not disable
> PASID capability if there is guest using this capability on any VF related
> to this PF. To solve it, may need to introduce a generic communication
> framework between vfio-pci driver and PF drivers. Please feel free to give
> your suggestions on it.

I'm not sure how this is addressed on bate metal today, i.e. between normal 
kernel PF and VF drivers. I look at pasid enable/disable code in intel-iommu.c.
There is no check on PF/VF dependency so far. The cap is toggled when 
attaching/detaching the PF to its domain. Let's see how IOMMU guys 
respond, and if there is a way for VF driver to block PF driver from disabling
the pasid cap when it's being actively used by VF driver, then we may
leverage the same trick in VFIO when emulation is provided to guest.

Thanks
Kevin

> 
> Regards,
> Yi Liu
> 
> Changelog:
>   - RFC v1 -> Patch v1:
> Add CONFIG_PCI_ATS #ifdef control to avoid compiling error.
> 
> Liu Yi L (2):
>   vfio/pci: Expose PCIe PASID capability to guest
>   vfio/pci: Emulate PASID/PRI capability for VFs
> 
>  drivers/vfio/pci/vfio_pci_config.c | 327
> -
>  1 file changed, 324 insertions(+), 3 deletions(-)
> 
> --
> 2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 1/3] iommu/uapi: Define uapi version and capabilities

2020-03-30 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Tuesday, March 31, 2020 12:08 AM
> 
> On Mon, 30 Mar 2020 05:40:40 +
> "Tian, Kevin"  wrote:
> 
> > > From: Jacob Pan 
> > > Sent: Saturday, March 28, 2020 7:54 AM
> > >
> > > On Fri, 27 Mar 2020 00:47:02 -0700
> > > Christoph Hellwig  wrote:
> > >
> > > > On Fri, Mar 27, 2020 at 02:49:55AM +, Tian, Kevin wrote:
> > > > > If those API calls are inter-dependent for composing a feature
> > > > > (e.g. SVA), shouldn't we need a way to check them together
> > > > > before exposing the feature to the guest, e.g. through a
> > > > > iommu_get_uapi_capabilities interface?
> > > >
> > > > Yes, that makes sense.  The important bit is to have a capability
> > > > flags and not version numbers.
> > >
> > > The challenge is that there are two consumers in the kernel for
> > > this. 1. VFIO only look for compatibility, and size of each data
> > > struct such that it can copy_from_user.
> > >
> > > 2. IOMMU driver, the "real consumer" of the content.
> > >
> > > For 2, I agree and we do plan to use the capability flags to check
> > > content and maintain backward compatibility etc.
> > >
> > > For VFIO, it is difficult to do size look up based on capability
> > > flags.
> >
> > Can you elaborate the difficulty in VFIO? if, as Christoph Hellwig
> > pointed out, version number is already avoided everywhere, it is
> > interesting to know whether this work becomes a real exception
> > or just requires a different mindset.
> >
> From VFIO p.o.v. the IOMMU UAPI data is opaque, it only needs to do two
> things:
> 1. is the UAPI compatible?
> 2. what is the size to copy?
> 
> If you look at the version number, this is really a "version as size"
> lookup, as provided by the helper function in this patch. An example
> can be the newly introduced clone3 syscall.
> https://lwn.net/Articles/792628/
> In clone3, new version must have new size. The slight difference here
> is that, unlike clone3, we have multiple data structures instead of a
> single struct clone_args {}. And each struct has flags to enumerate its
> contents besides size.

Thanks for providing that link. However clone3 doesn't include a version
field to do "version as size" lookup. Instead, as you said, it includes a
size parameter which sounds like the option 3 (argsz) listed below.

> 
> Besides breaching data abstraction, if VFIO has to check IOMMU flags to
> determine the sizes, it has many combinations.
> 
> We also separate the responsibilities into two parts
> 1. compatibility - version, size by VFIO
> 2. sanity check - capability flags - by IOMMU

I feel argsz+flags approach can perfectly meet above requirement. The
userspace set the size and flags for whatever capabilities it uses, and
VFIO simply copies the parameters by size and pass to IOMMU for
further sanity check. Of course the assumption is that we do provide
an interface for userspace to enumerate all supported capabilities.

Is there anything that I overlooked here? I suppose there might be
some difficulties that block you from going the argsz way...

Thanks
Kevin

> 
> I think the latter matches what Christoph's comments. So we are in
> agreement at the IOMMU level :)
> 
> For example:
> During guest PASID bind, IOMMU driver operates on the data passed from
> VFIO and check format & flags to take different code path.
> 
> #define IOMMU_PASID_FORMAT_INTEL_VTD  1
>   __u32 format;
> #define IOMMU_SVA_GPASID_VAL  (1 << 0) /* guest PASID valid */
>   __u64 flags;
> 
> Jacob
> 
> > btw the most relevant discussion which I can find out now is here:
> > https://lkml.org/lkml/2020/2/3/1126
> >
> > It mentioned 3 options for handling extension:
> > --
> > 1. Disallow adding new members to each structure other than reuse
> > padding bits or adding union members at the end.
> > 2. Allow extension of the structures beyond union, but union size has
> > to be fixed with reserved spaces
> > 3. Adopt VFIO argsz scheme, I don't think we need version for each
> > struct anymore. argsz implies the version that user is using assuming
> > UAPI data is extension only.
> > --
> >
> > the first two are both version-based. Looks most guys agreed with
> > option-1 (in this v2), but Alex didn't give his opinion at the
> > moment. The last response from him was the raise of option-3 using
> > argsz to avoid version. So, we also need hear from him. Alex?
> >
> > Thanks
> > Kevin
> 
> [Jacob Pan]
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Monday, March 30, 2020 10:37 PM
> 
> > From: Tian, Kevin 
> > Sent: Monday, March 30, 2020 4:32 PM
> > To: Liu, Yi L ; alex.william...@redhat.com;
> > Subject: RE: [PATCH v1 1/8] vfio: Add
> VFIO_IOMMU_PASID_REQUEST(alloc/free)
> >
> > > From: Liu, Yi L 
> > > Sent: Sunday, March 22, 2020 8:32 PM
> > >
> > > From: Liu Yi L 
> > >
> > > For a long time, devices have only one DMA address space from platform
> > > IOMMU's point of view. This is true for both bare metal and directed-
> > > access in virtualization environment. Reason is the source ID of DMA in
> > > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> >
> > are->is
> 
> thanks.
> 
> > > DMA isolation. However, this is changing with the latest advancement in
> > > I/O technology area. More and more platform vendors are utilizing the
> PCIe
> > > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > > address spaces as identified by their individual PASIDs. For example,
> > > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > > let device access multiple process virtual address space by binding the
> >
> > "address space" -> "address spaces"
> >
> > "binding the" -> "binding each"
> 
> will correct both.
> 
> > > virtual address space with a PASID. Wherein the PASID is allocated in
> > > software and programmed to device per device specific manner. Devices
> > > which support PASID capability are called PASID-capable devices. If such
> > > devices are passed through to VMs, guest software are also able to bind
> > > guest process virtual address space on such devices. Therefore, the guest
> > > software could reuse the bare metal software programming model,
> which
> > > means guest software will also allocate PASID and program it to device
> > > directly. This is a dangerous situation since it has potential PASID
> > > conflicts and unauthorized address space access. It would be safer to
> > > let host intercept in the guest software's PASID allocation. Thus PASID
> > > are managed system-wide.
> > >
> > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> > > passdown
> > > PASID allocation/free request from the virtual IOMMU. Additionally, such
> >
> > "Additionally, because such"
> >
> > > requests are intended to be invoked by QEMU or other applications
> which
> >
> > simplify to "intended to be invoked from userspace"
> 
> got it.
> 
> > > are running in userspace, it is necessary to have a mechanism to prevent
> > > single application from abusing available PASIDs in system. With such
> > > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > > has many assigned devices, then it should have more quota. However, it
> > > is not sure how many PASIDs an assigned devices will use. e.g. it is
> >
> > devices -> device
> 
> got it.
> 
> > > possible that a VM with multiples assigned devices but requests less
> > > PASIDs. Therefore per-VM quota would be better.
> > >
> > > This patch uses struct mm pointer as a per-VM token. We also considered
> > > using task structure pointer and vfio_iommu structure pointer. However,
> > > task structure is per-thread, which means it cannot achieve per-VM PASID
> > > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > > only within vfio. Therefore, structure mm pointer is selected. This patch
> > > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > > container is opened by a VM. On the reverse order, vfio_mm is free when
> > > the last vfio container is released. Each VM is assigned with a PASID
> > > quota, so that it is not able to request PASID beyond its quota. This
> > > patch adds a default quota of 1000. This quota could be tuned by
> > > administrator. Making PASID quota tunable will be added in another
> patch
> > > in this series.
> > >
> > > Previous discussions:
> > > https://patchwork.kernel.org/patch/11209429/
> > >
> > > Cc: Kevin Tian 
> > > CC: Jacob Pan 
> > > Cc: Alex Williamson 
> > > Cc: Eric Auger 
> > > Cc: Jean-Philippe Brucker 
> > > Signed-off-by: Liu Yi L 
> > > Signed-off-by: Yi Sun 
> > > Signed-off-by: Jacob Pan 
> > > ---
> > >  drivers/vfio/vfio.c | 130
> > > 
> > >  drivers/vfio/vfio_iommu_type1.c | 104
> > > 
> > >  include/linux/vfio.h|  20 +++
> > >  include/uapi/linux/vfio.h   |  41 +
> > >  4 files changed, 295 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > > index c848262..d13b483 100644
> > > --- a/drivers/vfio/vfio.c
> > > +++ b/drivers/vfio/vfio.c
> > > @@ -32,6 +32,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >
> > >  #define DRIVER_VERSION   "0.3"
> > >  #define DRIVER_AUTHOR"Alex Williamson
> > > "
> > > @@ -46,6 +47,8 @@ static 

Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Anthony Yznaga



On 3/30/20 1:42 PM, Alexander Graf wrote:
>
>
> On 30.03.20 15:40, Konrad Rzeszutek Wilk wrote:
>>
>>
>>
>> On Mon, Mar 30, 2020 at 02:06:01PM +0800, Kairui Song wrote:
>>> On Sat, Mar 28, 2020 at 7:57 PM Dave Young  wrote:

 On 03/26/20 at 05:29pm, Alexander Graf wrote:
> The swiotlb is a very convenient fallback mechanism for bounce buffering 
> of
> DMAable data. It is usually used for the compatibility case where devices
> can only DMA to a "low region".
>
> However, in some scenarios this "low region" may be bound even more
> heavily. For example, there are embedded system where only an SRAM region
> is shared between device and CPU. There are also heterogeneous computing
> scenarios where only a subset of RAM is cache coherent between the
> components of the system. There are partitioning hypervisors, where
> a "control VM" that implements device emulation has limited view into a
> partition's memory for DMA capabilities due to safety concerns.
>
> This patch adds a command line driven mechanism to move all DMA memory 
> into
> a predefined shared memory region which may or may not be part of the
> physical address layout of the Operating System.
>
> Ideally, the typical path to set this configuration would be through 
> Device
> Tree or ACPI, but neither of the two mechanisms is standardized yet. Also,
> in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
> instead configure the system purely through kernel command line options.
>
> I'm sure other people will find the functionality useful going forward
> though and extend it to be triggered by DT/ACPI in the future.

 Hmm, we have a use case for kdump, this maybe useful.  For example
 swiotlb is enabled by default if AMD SME/SEV is active, and in kdump
 kernel we have to increase the crashkernel reserved size for the extra
 swiotlb requirement.  I wonder if we can just reuse the old kernel's
 swiotlb region and pass the addr to kdump kernel.

>>>
>>> Yes, definitely helpful for kdump kernel. This can help reduce the
>>> crashkernel value.
>>>
>>> Previously I was thinking about something similar, play around the
>>> e820 entry passed to kdump and let it place swiotlb in wanted region.
>>> Simply remap it like in this patch looks much cleaner.
>>>
>>> If this patch is acceptable, one more patch is needed to expose the
>>> swiotlb in iomem, so kexec-tools can pass the right kernel cmdline to
>>> second kernel.
>>
>> We seem to be passsing a lot of data to kexec.. Perhaps something
>> of a unified way since we seem to have a lot of things to pass - disabling
>> IOMMU, ACPI RSDT address, and then this.
>>
>> CC-ing Anthony who is working on something - would you by any chance
>> have a doc on this?
>
>
> I see in general 2 use cases here:
>
>
> 1) Allow for a generic mechanism to have the fully system, individual buses, 
> devices or functions of a device go through a particular, self-contained 
> bounce buffer.
>
> This sounds like the holy grail to a lot of problems. It would solve typical 
> embedded scenarios where you only have a shared SRAM. It solves the safety 
> case (to some extent) where you need to ensure that one device interaction 
> doesn't conflict with another device interaction. It also solves the problem 
> I've tried to solve with the patch here.
>
> It's unfortunately a lot harder than the patch I sent, so it will take me 
> some time to come up with a working patch set.. I suppose starting with a DT 
> binding only is sensible. Worst case, x86 does also support DT ...
>
> (And yes, I'm always happy to review patches if someone else beats me to it)

Not precisely what is described here, but I am working on a somewhat generic 
mechanism for preserving pages across kexec based on this old RFC patch set: 
https://lkml.org/lkml/2013/7/1/211.  I expect to post patches soon.

Anthony

>
>
> 2) Reuse the SWIOTLB from the previous boot on kexec/kdump
>
> I see little direct relation to SEV here. The only reason SEV makes it more 
> relevant, is that you need to have an SWIOTLB region available with SEV while 
> without you could live with a disabled IOMMU.
>
> However, I can definitely understand how you would want to have a way to tell 
> the new kexec'ed kernel where the old SWIOTLB was, so it can reuse its memory 
> for its own SWIOTLB. That way, you don't have to reserve another 64MB of RAM 
> for kdump.
>
> What I'm curious on is whether we need to be as elaborate. Can't we just pass 
> the old SWIOTLB as free memory to the new kexec'ed kernel and everything else 
> will fall into place? All that would take is a bit of shuffling on the e820 
> table pass-through to the kexec'ed kernel, no?
>
>
> Thanks,
>
> Alex
>
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht

RE: [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support

2020-03-30 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Tuesday, March 31, 2020 4:52 AM
> 
> On Sat, 28 Mar 2020 08:02:01 +
> "Tian, Kevin"  wrote:
> 
> > > From: Jacob Pan 
> > > Sent: Saturday, March 21, 2020 7:28 AM
> > >
> > > When supporting guest SVA with emulated IOMMU, the guest PASID
> > > table is shadowed in VMM. Updates to guest vIOMMU PASID table
> > > will result in PASID cache flush which will be passed down to
> > > the host as bind guest PASID calls.
> > >
> > > For the SL page tables, it will be harvested from device's
> > > default domain (request w/o PASID), or aux domain in case of
> > > mediated device.
> > >
> > > .-.  .---.
> > > |   vIOMMU|  | Guest process CR3, FL only|
> > > | |  '---'
> > > ./
> > > | PASID Entry |--- PASID cache flush -
> > > '-'   |
> > > | |   V
> > > | |CR3 in GPA
> > > '-'
> > > Guest
> > > --| Shadow |--|
> > >   vv  v
> > > Host
> > > .-.  .--.
> > > |   pIOMMU|  | Bind FL for GVA-GPA  |
> > > | |  '--'
> > > ./  |
> > > | PASID Entry | V (Nested xlate)
> > > '\.--.
> > > | |   |SL for GPA-HPA, default domain|
> > > | |   '--'
> > > '-'
> > > Where:
> > >  - FL = First level/stage one page tables
> > >  - SL = Second level/stage two page tables
> > >
> > > Signed-off-by: Jacob Pan 
> > > Signed-off-by: Liu, Yi L 
> > > ---
> > >  drivers/iommu/intel-iommu.c |   4 +
> > >  drivers/iommu/intel-svm.c   | 224
> > > 
> > >  include/linux/intel-iommu.h |   8 +-
> > >  include/linux/intel-svm.h   |  17 
> > >  4 files changed, 252 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/iommu/intel-iommu.c
> > > b/drivers/iommu/intel-iommu.c index e599b2537b1c..b1477cd423dd
> > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > +++ b/drivers/iommu/intel-iommu.c
> > > @@ -6203,6 +6203,10 @@ const struct iommu_ops intel_iommu_ops = {
> > >   .dev_disable_feat   = intel_iommu_dev_disable_feat,
> > >   .is_attach_deferred =
> > > intel_iommu_is_attach_deferred, .pgsize_bitmap=
> > > INTEL_IOMMU_PGSIZES, +#ifdef CONFIG_INTEL_IOMMU_SVM
> > > + .sva_bind_gpasid= intel_svm_bind_gpasid,
> > > + .sva_unbind_gpasid  = intel_svm_unbind_gpasid,
> > > +#endif
> > >  };
> > >
> > >  static void quirk_iommu_igfx(struct pci_dev *dev)
> > > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > > index d7f2a5358900..47c0deb5ae56 100644
> > > --- a/drivers/iommu/intel-svm.c
> > > +++ b/drivers/iommu/intel-svm.c
> > > @@ -226,6 +226,230 @@ static LIST_HEAD(global_svm_list);
> > >   list_for_each_entry((sdev), &(svm)->devs, list) \
> > >   if ((d) != (sdev)->dev) {} else
> > >
> > > +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > > + struct device *dev,
> > > + struct iommu_gpasid_bind_data *data)
> > > +{
> > > + struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > > + struct dmar_domain *ddomain;
> >
> > what about the full name e.g. dmar_domain? though a bit longer
> > but clearer than ddomain.
> >
> Sure, I don't have preference.
> 
> > > + struct intel_svm_dev *sdev;
> > > + struct intel_svm *svm;
> > > + int ret = 0;
> > > +
> > > + if (WARN_ON(!iommu) || !data)
> > > + return -EINVAL;
> > > +
> > > + if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> > > + data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> > > + return -EINVAL;
> > > +
> > > + if (dev_is_pci(dev)) {
> > > + /* VT-d supports devices with full 20 bit PASIDs
> > > only */
> > > + if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> > > + return -EINVAL;
> > > + } else {
> > > + return -ENOTSUPP;
> > > + }
> > > +
> > > + /*
> > > +  * We only check host PASID range, we have no knowledge to
> > > check
> > > +  * guest PASID range nor do we use the guest PASID.
> > > +  */
> > > + if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> > > + return -EINVAL;
> > > +
> > > + ddomain = to_dmar_domain(domain);
> > > +
> > > + /* Sanity check paging mode support match between host and
> > > guest */
> > > + if (data->addr_width == ADDR_WIDTH_5LEVEL &&
> > > + !cap_5lp_support(iommu->cap)) {
> > > + pr_err("Cannot support 5 level paging requested by
> > > guest!\n");
> > > + return -EINVAL;
> > > + }
> >
> > -ENOTSUPP?
> I was thinking from this API p.o.v, the input is invalid. Since both
> cap and addr_width are derived from input arguments.

ok, suppose the userspace alread

RE: [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function

2020-03-30 Thread Tian, Kevin
> From: Jacob Pan 
> Sent: Tuesday, March 31, 2020 2:22 AM
> 
> On Sun, 29 Mar 2020 16:03:36 +0800
> Lu Baolu  wrote:
> 
> > On 2020/3/27 20:21, Tian, Kevin wrote:
> > >> From: Jacob Pan 
> > >> Sent: Saturday, March 21, 2020 7:28 AM
> > >>
> > >> Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.
> > >
> > > now the spec is already at rev3.1 😊
> >
> > Updated.
> >
> > >
> > >> With PASID granular translation type set to 0x11b, translation
> > >> result from the first level(FL) also subject to a second level(SL)
> > >> page table translation. This mode is used for SVA virtualization,
> > >> where FL performs guest virtual to guest physical translation and
> > >> SL performs guest physical to host physical translation.
> > >>
> > >> This patch adds a helper function for setting up nested translation
> > >> where second level comes from a domain and first level comes from
> > >> a guest PGD.
> > >>
> > >> Signed-off-by: Jacob Pan 
> > >> Signed-off-by: Liu, Yi L 
> > >> ---
> > >>   drivers/iommu/intel-pasid.c | 240
> > >> +++-
> > >>   drivers/iommu/intel-pasid.h |  12 +++
> > >>   include/linux/intel-iommu.h |   3 +
> > >>   3 files changed, 252 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/drivers/iommu/intel-pasid.c
> > >> b/drivers/iommu/intel-pasid.c index 9bdb7ee228b6..10c7856afc6b
> > >> 100644 --- a/drivers/iommu/intel-pasid.c
> > >> +++ b/drivers/iommu/intel-pasid.c
> > >> @@ -359,6 +359,76 @@ pasid_set_flpm(struct pasid_entry *pe, u64
> > >> value) pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
> > >>   }
> > >>
> > >> +/*
> > >> + * Setup the Extended Memory Type(EMT) field (Bits 91-93)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_emt(struct pasid_entry *pe, u64 value)
> > >> +{
> > >> +pasid_set_bits(&pe->val[1], GENMASK_ULL(29, 27), value <<
> > >> 27); +}
> > >> +
> > >> +/*
> > >> + * Setup the Page Attribute Table (PAT) field (Bits 96-127)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_pat(struct pasid_entry *pe, u64 value)
> > >> +{
> > >> +pasid_set_bits(&pe->val[1], GENMASK_ULL(63, 32), value <<
> > >> 32); +}
> > >> +
> > >> +/*
> > >> + * Setup the Cache Disable (CD) field (Bit 89)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_cd(struct pasid_entry *pe)
> > >> +{
> > >> +pasid_set_bits(&pe->val[1], 1 << 25, 1 << 25);
> > >> +}
> > >> +
> > >> +/*
> > >> + * Setup the Extended Memory Type Enable (EMTE) field (Bit 90)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_emte(struct pasid_entry *pe)
> > >> +{
> > >> +pasid_set_bits(&pe->val[1], 1 << 26, 1 << 26);
> > >> +}
> > >> +
> > >> +/*
> > >> + * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_eafe(struct pasid_entry *pe)
> > >> +{
> > >> +pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
> > >> +}
> > >> +
> > >> +/*
> > >> + * Setup the Page-level Cache Disable (PCD) field (Bit 95)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_pcd(struct pasid_entry *pe)
> > >> +{
> > >> +pasid_set_bits(&pe->val[1], 1 << 31, 1 << 31);
> > >> +}
> > >> +
> > >> +/*
> > >> + * Setup the Page-level Write-Through (PWT)) field (Bit 94)
> > >> + * of a scalable mode PASID entry.
> > >> + */
> > >> +static inline void
> > >> +pasid_set_pwt(struct pasid_entry *pe)
> > >> +{
> > >> +pasid_set_bits(&pe->val[1], 1 << 30, 1 << 30);
> > >> +}
> > >> +
> > >>   static void
> > >>   pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
> > >>  u16 did, int pasid)
> > >> @@ -492,7 +562,7 @@ int intel_pasid_setup_first_level(struct
> > >> intel_iommu *iommu,
> > >>  pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> > >>
> > >>  /* Setup Present and PASID Granular Transfer Type: */
> > >> -pasid_set_translation_type(pte, 1);
> > >> +pasid_set_translation_type(pte, PASID_ENTRY_PGTT_FL_ONLY);
> > >>  pasid_set_present(pte);
> > >>  pasid_flush_caches(iommu, pte, pasid, did);
> > >>
> > >> @@ -564,7 +634,7 @@ int intel_pasid_setup_second_level(struct
> > >> intel_iommu *iommu,
> > >>  pasid_set_domain_id(pte, did);
> > >>  pasid_set_slptr(pte, pgd_val);
> > >>  pasid_set_address_width(pte, agaw);
> > >> -pasid_set_translation_type(pte, 2);
> > >> +pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
> > >>  pasid_set_fault_enable(pte);
> > >>  pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> > >>
> > >> @@ -598,7 +668,7 @@ int intel_pasid_setup_pass_through(struct
> > >> intel_iommu *iommu,
> > >>  pasi

RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function

2020-03-30 Thread Tian, Kevin
> From: Auger Eric 
> Sent: Monday, March 30, 2020 12:05 AM
> 
> On 3/28/20 11:01 AM, Tian, Kevin wrote:
> >> From: Jacob Pan 
> >> Sent: Saturday, March 21, 2020 7:28 AM
> >>
> >> When Shared Virtual Address (SVA) is enabled for a guest OS via
> >> vIOMMU, we need to provide invalidation support at IOMMU API and
> driver
> >> level. This patch adds Intel VT-d specific function to implement
> >> iommu passdown invalidate API for shared virtual address.
> >>
> >> The use case is for supporting caching structure invalidation
> >> of assigned SVM capable devices. Emulated IOMMU exposes queue
> >
> > emulated IOMMU -> vIOMMU, since virito-iommu could use the
> > interface as well.
> >
> >> invalidation capability and passes down all descriptors from the guest
> >> to the physical IOMMU.
> >>
> >> The assumption is that guest to host device ID mapping should be
> >> resolved prior to calling IOMMU driver. Based on the device handle,
> >> host IOMMU driver can replace certain fields before submit to the
> >> invalidation queue.
> >>
> >> ---
> >> v7 review fixed in v10
> >> ---
> >>
> >> Signed-off-by: Jacob Pan 
> >> Signed-off-by: Ashok Raj 
> >> Signed-off-by: Liu, Yi L 
> >> ---
> >>  drivers/iommu/intel-iommu.c | 182
> >> 
> >>  1 file changed, 182 insertions(+)
> >>
> >> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> >> index b1477cd423dd..a76afb0fd51a 100644
> >> --- a/drivers/iommu/intel-iommu.c
> >> +++ b/drivers/iommu/intel-iommu.c
> >> @@ -5619,6 +5619,187 @@ static void
> >> intel_iommu_aux_detach_device(struct iommu_domain *domain,
> >>aux_domain_remove_dev(to_dmar_domain(domain), dev);
> >>  }
> >>
> >> +/*
> >> + * 2D array for converting and sanitizing IOMMU generic TLB granularity
> to
> >> + * VT-d granularity. Invalidation is typically included in the unmap
> operation
> >> + * as a result of DMA or VFIO unmap. However, for assigned devices
> guest
> >> + * owns the first level page tables. Invalidations of translation caches 
> >> in
> the
> >> + * guest are trapped and passed down to the host.
> >> + *
> >> + * vIOMMU in the guest will only expose first level page tables, therefore
> >> + * we do not include IOTLB granularity for request without PASID (second
> >> level).
> >
> > I would revise above as "We do not support IOTLB granularity for request
> > without PASID (second level), therefore any vIOMMU implementation that
> > exposes the SVA capability to the guest should only expose the first level
> > page tables, implying all invalidation requests from the guest will include
> > a valid PASID"
> >
> >> + *
> >> + * For example, to find the VT-d granularity encoding for IOTLB
> >> + * type and page selective granularity within PASID:
> >> + * X: indexed by iommu cache type
> >> + * Y: indexed by enum iommu_inv_granularity
> >> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> >> + *
> >> + * Granu_map array indicates validity of the table. 1: valid, 0: invalid
> >> + *
> >> + */
> >> +const static int
> >>
> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> >> NR] = {
> >> +  /*
> >> +   * PASID based IOTLB invalidation: PASID selective (per PASID),
> >> +   * page selective (address granularity)
> >> +   */
> >> +  {0, 1, 1},
> >> +  /* PASID based dev TLBs, only support all PASIDs or single PASID */
> >> +  {1, 1, 0},
> >
> > Is this combination correct? when single PASID is being specified, it is
> > essentially a page-selective invalidation since you need provide Address
> > and Size.
> >
> >> +  /* PASID cache */
> >
> > PASID cache is fully managed by the host. Guest PASID cache invalidation
> > is interpreted by vIOMMU for bind and unbind operations. I don't think
> > we should accept any PASID cache invalidation from userspace or guest.
> I tend to agree here.
> >
> >> +  {1, 1, 0}
> >> +};
> >> +
> >> +const static int
> >>
> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU
> >> _NR] = {
> >> +  /* PASID based IOTLB */
> >> +  {0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> >> +  /* PASID based dev TLBs */
> >> +  {QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> >> +  /* PASID cache */
> >> +  {QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> >> +};
> >> +
> >> +static inline int to_vtd_granularity(int type, int granu, int *vtd_granu)
> >> +{
> >> +  if (type >= IOMMU_CACHE_INV_TYPE_NR || granu >=
> >> IOMMU_INV_GRANU_NR ||
> >> +  !inv_type_granu_map[type][granu])
> >> +  return -EINVAL;
> >> +
> >> +  *vtd_granu = inv_type_granu_table[type][granu];
> >> +
> >
> > btw do we really need both map and table here? Can't we just
> > use one table with unsupported granularity marked as a special
> > value?
> I asked the same question some time ago. If I remember correctly the
> issue is while a granu can be supported in inv_type_granu_map, the
> associated value in inv_type_granu_table can be 0. This typically
> matches both values of G field (0 or 1) in the inv

RE: [PATCH V10 08/11] iommu/vt-d: Add svm/sva invalidate function

2020-03-30 Thread Tian, Kevin
> From: Auger Eric 
> Sent: Sunday, March 29, 2020 11:34 PM
> 
> Hi,
> 
> On 3/28/20 11:01 AM, Tian, Kevin wrote:
> >> From: Jacob Pan 
> >> Sent: Saturday, March 21, 2020 7:28 AM
> >>
> >> When Shared Virtual Address (SVA) is enabled for a guest OS via
> >> vIOMMU, we need to provide invalidation support at IOMMU API and
> driver
> >> level. This patch adds Intel VT-d specific function to implement
> >> iommu passdown invalidate API for shared virtual address.
> >>
> >> The use case is for supporting caching structure invalidation
> >> of assigned SVM capable devices. Emulated IOMMU exposes queue
> >
> > emulated IOMMU -> vIOMMU, since virito-iommu could use the
> > interface as well.
> >
> >> invalidation capability and passes down all descriptors from the guest
> >> to the physical IOMMU.
> >>
> >> The assumption is that guest to host device ID mapping should be
> >> resolved prior to calling IOMMU driver. Based on the device handle,
> >> host IOMMU driver can replace certain fields before submit to the
> >> invalidation queue.
> >>
> >> ---
> >> v7 review fixed in v10
> >> ---
> >>
> >> Signed-off-by: Jacob Pan 
> >> Signed-off-by: Ashok Raj 
> >> Signed-off-by: Liu, Yi L 
> >> ---
> >>  drivers/iommu/intel-iommu.c | 182
> >> 
> >>  1 file changed, 182 insertions(+)
> >>
> >> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> >> index b1477cd423dd..a76afb0fd51a 100644
> >> --- a/drivers/iommu/intel-iommu.c
> >> +++ b/drivers/iommu/intel-iommu.c
> >> @@ -5619,6 +5619,187 @@ static void
> >> intel_iommu_aux_detach_device(struct iommu_domain *domain,
> >>aux_domain_remove_dev(to_dmar_domain(domain), dev);
> >>  }
> >>
> >> +/*
> >> + * 2D array for converting and sanitizing IOMMU generic TLB granularity
> to
> >> + * VT-d granularity. Invalidation is typically included in the unmap
> operation
> >> + * as a result of DMA or VFIO unmap. However, for assigned devices
> guest
> >> + * owns the first level page tables. Invalidations of translation caches 
> >> in
> the
> >> + * guest are trapped and passed down to the host.
> >> + *
> >> + * vIOMMU in the guest will only expose first level page tables, therefore
> >> + * we do not include IOTLB granularity for request without PASID (second
> >> level).
> >
> > I would revise above as "We do not support IOTLB granularity for request
> > without PASID (second level), therefore any vIOMMU implementation that
> > exposes the SVA capability to the guest should only expose the first level
> > page tables, implying all invalidation requests from the guest will include
> > a valid PASID"
> >
> >> + *
> >> + * For example, to find the VT-d granularity encoding for IOTLB
> >> + * type and page selective granularity within PASID:
> >> + * X: indexed by iommu cache type
> >> + * Y: indexed by enum iommu_inv_granularity
> >> + * [IOMMU_CACHE_INV_TYPE_IOTLB][IOMMU_INV_GRANU_ADDR]
> >> + *
> >> + * Granu_map array indicates validity of the table. 1: valid, 0: invalid
> >> + *
> >> + */
> >> +const static int
> >>
> inv_type_granu_map[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU_
> >> NR] = {
> >> +  /*
> >> +   * PASID based IOTLB invalidation: PASID selective (per PASID),
> >> +   * page selective (address granularity)
> >> +   */
> >> +  {0, 1, 1},
> >> +  /* PASID based dev TLBs, only support all PASIDs or single PASID */
> >> +  {1, 1, 0},
> >
> > Is this combination correct? when single PASID is being specified, it is
> > essentially a page-selective invalidation since you need provide Address
> > and Size.
> Isn't it the same when G=1? Still the addr/size is used. Doesn't it

I thought addr/size is not used when G=1, but it might be wrong. I'm
checking with our vt-d spec owner.

> correspond to IOMMU_INV_GRANU_ADDR with
> IOMMU_INV_ADDR_FLAGS_PASID flag
> unset?
> 
> so {0, 0, 1}?

I have one more open:

How does userspace know which invalidation type/gran is supported?
I didn't see such capability reporting in Yi's VFIO vSVA patch set. Do we
want the user/kernel assume the same capability set if they are 
architectural? However the kernel could also do some optimization
e.g. hide devtlb invalidation capability given that the kernel already 
invalidate devtlb automatically when serving iotlb invalidation...

Thanks
Kevin

> 
> Thanks
> 
> Eric
> 
> >
> >> +  /* PASID cache */
> >
> > PASID cache is fully managed by the host. Guest PASID cache invalidation
> > is interpreted by vIOMMU for bind and unbind operations. I don't think
> > we should accept any PASID cache invalidation from userspace or guest.
> >
> >> +  {1, 1, 0}
> >> +};
> >> +
> >> +const static int
> >>
> inv_type_granu_table[IOMMU_CACHE_INV_TYPE_NR][IOMMU_INV_GRANU
> >> _NR] = {
> >> +  /* PASID based IOTLB */
> >> +  {0, QI_GRAN_NONG_PASID, QI_GRAN_PSI_PASID},
> >> +  /* PASID based dev TLBs */
> >> +  {QI_DEV_IOTLB_GRAN_ALL, QI_DEV_IOTLB_GRAN_PASID_SEL, 0},
> >> +  /* PASID cache */
> >> +  {QI_PC_ALL_PASIDS, QI_PC_PASID_SEL, 0},
> >> +

Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Baoquan He
On 03/30/20 at 10:42pm, Alexander Graf wrote:
> 
> 
> On 30.03.20 15:40, Konrad Rzeszutek Wilk wrote:
> > 
> > 
> > 
> > On Mon, Mar 30, 2020 at 02:06:01PM +0800, Kairui Song wrote:
> > > On Sat, Mar 28, 2020 at 7:57 PM Dave Young  wrote:
> > > > 
> > > > On 03/26/20 at 05:29pm, Alexander Graf wrote:
> > > > > The swiotlb is a very convenient fallback mechanism for bounce 
> > > > > buffering of
> > > > > DMAable data. It is usually used for the compatibility case where 
> > > > > devices
> > > > > can only DMA to a "low region".
> > > > > 
> > > > > However, in some scenarios this "low region" may be bound even more
> > > > > heavily. For example, there are embedded system where only an SRAM 
> > > > > region
> > > > > is shared between device and CPU. There are also heterogeneous 
> > > > > computing
> > > > > scenarios where only a subset of RAM is cache coherent between the
> > > > > components of the system. There are partitioning hypervisors, where
> > > > > a "control VM" that implements device emulation has limited view into 
> > > > > a
> > > > > partition's memory for DMA capabilities due to safety concerns.
> > > > > 
> > > > > This patch adds a command line driven mechanism to move all DMA 
> > > > > memory into
> > > > > a predefined shared memory region which may or may not be part of the
> > > > > physical address layout of the Operating System.
> > > > > 
> > > > > Ideally, the typical path to set this configuration would be through 
> > > > > Device
> > > > > Tree or ACPI, but neither of the two mechanisms is standardized yet. 
> > > > > Also,
> > > > > in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
> > > > > instead configure the system purely through kernel command line 
> > > > > options.
> > > > > 
> > > > > I'm sure other people will find the functionality useful going forward
> > > > > though and extend it to be triggered by DT/ACPI in the future.
> > > > 
> > > > Hmm, we have a use case for kdump, this maybe useful.  For example
> > > > swiotlb is enabled by default if AMD SME/SEV is active, and in kdump
> > > > kernel we have to increase the crashkernel reserved size for the extra
> > > > swiotlb requirement.  I wonder if we can just reuse the old kernel's
> > > > swiotlb region and pass the addr to kdump kernel.
> > > > 
> > > 
> > > Yes, definitely helpful for kdump kernel. This can help reduce the
> > > crashkernel value.
> > > 
> > > Previously I was thinking about something similar, play around the
> > > e820 entry passed to kdump and let it place swiotlb in wanted region.
> > > Simply remap it like in this patch looks much cleaner.
> > > 
> > > If this patch is acceptable, one more patch is needed to expose the
> > > swiotlb in iomem, so kexec-tools can pass the right kernel cmdline to
> > > second kernel.
> > 
> > We seem to be passsing a lot of data to kexec.. Perhaps something
> > of a unified way since we seem to have a lot of things to pass - disabling
> > IOMMU, ACPI RSDT address, and then this.
> > 
> > CC-ing Anthony who is working on something - would you by any chance
> > have a doc on this?
> 
> 
> I see in general 2 use cases here:
> 
> 
> 1) Allow for a generic mechanism to have the fully system, individual buses,
> devices or functions of a device go through a particular, self-contained
> bounce buffer.
> 
> This sounds like the holy grail to a lot of problems. It would solve typical
> embedded scenarios where you only have a shared SRAM. It solves the safety
> case (to some extent) where you need to ensure that one device interaction
> doesn't conflict with another device interaction. It also solves the problem
> I've tried to solve with the patch here.
> 
> It's unfortunately a lot harder than the patch I sent, so it will take me
> some time to come up with a working patch set.. I suppose starting with a DT
> binding only is sensible. Worst case, x86 does also support DT ...
> 
> (And yes, I'm always happy to review patches if someone else beats me to it)
> 
> 
> 2) Reuse the SWIOTLB from the previous boot on kexec/kdump
> 
> I see little direct relation to SEV here. The only reason SEV makes it more
> relevant, is that you need to have an SWIOTLB region available with SEV
> while without you could live with a disabled IOMMU.
> 
> However, I can definitely understand how you would want to have a way to
> tell the new kexec'ed kernel where the old SWIOTLB was, so it can reuse its
> memory for its own SWIOTLB. That way, you don't have to reserve another 64MB
> of RAM for kdump.
> 
> What I'm curious on is whether we need to be as elaborate. Can't we just
> pass the old SWIOTLB as free memory to the new kexec'ed kernel and
> everything else will fall into place? All that would take is a bit of
> shuffling on the e820 table pass-through to the kexec'ed kernel, no?

Swiotlb memory have to be continuous. We can't guarantee that region
won't be touched by kernel allocation before swiotlb init. Then we may
not have chance to get a continuous re

Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Dave Young
Hi,

[snip]
> 2) Reuse the SWIOTLB from the previous boot on kexec/kdump

We should only care about kdump

> 
> I see little direct relation to SEV here. The only reason SEV makes it more
> relevant, is that you need to have an SWIOTLB region available with SEV
> while without you could live with a disabled IOMMU.


Here is some comment in arch/x86/kernel/pci-swiotlb.c, it is enforced
for some reason.
/*
 * If SME is active then swiotlb will be set to 1 so that bounce
 * buffers are allocated and used for devices that do not support
 * the addressing range required for the encryption mask.
 */
if (sme_active())
swiotlb = 1;

> 
> However, I can definitely understand how you would want to have a way to
> tell the new kexec'ed kernel where the old SWIOTLB was, so it can reuse its
> memory for its own SWIOTLB. That way, you don't have to reserve another 64MB
> of RAM for kdump.
> 
> What I'm curious on is whether we need to be as elaborate. Can't we just
> pass the old SWIOTLB as free memory to the new kexec'ed kernel and
> everything else will fall into place? All that would take is a bit of
> shuffling on the e820 table pass-through to the kexec'ed kernel, no?

Maybe either of the two is fine.  But we may need ensure these swiotlb
area to be reused explictly in some way.  Say about the crashkernel=X,high case,
major part is in above 4G region, and a small piece in low memory. Then
when kernel booting, kernel/driver initialization could use out of the
low memory, and the remain part for swiotlb could be not big enough and
finally swiotlb allocation fails. 

Thanks
Dave

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Dave Young
On 03/30/20 at 09:40am, Konrad Rzeszutek Wilk wrote:
> On Mon, Mar 30, 2020 at 02:06:01PM +0800, Kairui Song wrote:
> > On Sat, Mar 28, 2020 at 7:57 PM Dave Young  wrote:
> > >
> > > On 03/26/20 at 05:29pm, Alexander Graf wrote:
> > > > The swiotlb is a very convenient fallback mechanism for bounce 
> > > > buffering of
> > > > DMAable data. It is usually used for the compatibility case where 
> > > > devices
> > > > can only DMA to a "low region".
> > > >
> > > > However, in some scenarios this "low region" may be bound even more
> > > > heavily. For example, there are embedded system where only an SRAM 
> > > > region
> > > > is shared between device and CPU. There are also heterogeneous computing
> > > > scenarios where only a subset of RAM is cache coherent between the
> > > > components of the system. There are partitioning hypervisors, where
> > > > a "control VM" that implements device emulation has limited view into a
> > > > partition's memory for DMA capabilities due to safety concerns.
> > > >
> > > > This patch adds a command line driven mechanism to move all DMA memory 
> > > > into
> > > > a predefined shared memory region which may or may not be part of the
> > > > physical address layout of the Operating System.
> > > >
> > > > Ideally, the typical path to set this configuration would be through 
> > > > Device
> > > > Tree or ACPI, but neither of the two mechanisms is standardized yet. 
> > > > Also,
> > > > in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
> > > > instead configure the system purely through kernel command line options.
> > > >
> > > > I'm sure other people will find the functionality useful going forward
> > > > though and extend it to be triggered by DT/ACPI in the future.
> > >
> > > Hmm, we have a use case for kdump, this maybe useful.  For example
> > > swiotlb is enabled by default if AMD SME/SEV is active, and in kdump
> > > kernel we have to increase the crashkernel reserved size for the extra
> > > swiotlb requirement.  I wonder if we can just reuse the old kernel's
> > > swiotlb region and pass the addr to kdump kernel.
> > >
> > 
> > Yes, definitely helpful for kdump kernel. This can help reduce the
> > crashkernel value.
> > 
> > Previously I was thinking about something similar, play around the
> > e820 entry passed to kdump and let it place swiotlb in wanted region.
> > Simply remap it like in this patch looks much cleaner.
> > 
> > If this patch is acceptable, one more patch is needed to expose the
> > swiotlb in iomem, so kexec-tools can pass the right kernel cmdline to
> > second kernel.
> 
> We seem to be passsing a lot of data to kexec.. Perhaps something
> of a unified way since we seem to have a lot of things to pass - disabling
> IOMMU, ACPI RSDT address, and then this.

acpi_rsdp kernel cmdline is not useful anymore.  The initial purpose is
for kexec-rebooting in efi system.  But now we have supported efi boot
across kexec reboot thus that is useless now.

Thanks
Dave

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

2020-03-30 Thread Ashish Kalra
Hello Konrad,

On Tue, Mar 03, 2020 at 12:03:53PM -0500, Konrad Rzeszutek Wilk wrote:
> On Tue, Feb 04, 2020 at 07:35:00PM +, Ashish Kalra wrote:
> > Hello Konrad,
> > 
> > Looking fwd. to your feedback regarding support of other memory
> > encryption architectures such as Power, S390, etc.
> > 
> > Thanks,
> > Ashish
> > 
> > On Fri, Jan 24, 2020 at 11:00:08PM +, Ashish Kalra wrote:
> > > On Tue, Jan 21, 2020 at 03:54:03PM -0500, Konrad Rzeszutek Wilk wrote:
> > > > > 
> > > > > Additional memory calculations based on # of PCI devices and
> > > > > their memory ranges will make it more complicated with so
> > > > > many other permutations and combinations to explore, it is
> > > > > essential to keep this patch as simple as possible by 
> > > > > adjusting the bounce buffer size simply by determining it
> > > > > from the amount of provisioned guest memory.
> > > >> 
> > > >> Please rework the patch to:
> > > >> 
> > > >>  - Use a log solution instead of the multiplication.
> > > >>Feel free to cap it at a sensible value.
> > > 
> > > Ok.
> > > 
> > > >> 
> > > >>  - Also the code depends on SWIOTLB calling in to the
> > > >>adjust_swiotlb_default_size which looks wrong.
> > > >> 
> > > >>You should not adjust io_tlb_nslabs from swiotlb_size_or_default.
> > > 
> > > >>That function's purpose is to report a value.
> > > >> 
> > > >>  - Make io_tlb_nslabs be visible outside of the SWIOTLB code.
> > > >> 
> > > >>  - Can you utilize the IOMMU_INIT APIs and have your own detect which 
> > > >> would
> > > >>modify the io_tlb_nslabs (and set swiotbl=1?).
> > > 
> > > This seems to be a nice option, but then IOMMU_INIT APIs are
> > > x86-specific and this swiotlb buffer size adjustment is also needed
> > > for other memory encryption architectures like Power, S390, etc.
> 
> Oh dear. That I hadn't considered.
> > > 
> > > >> 
> > > >>Actually you seem to be piggybacking on pci_swiotlb_detect_4gb - so
> > > >>perhaps add in this code ? Albeit it really should be in it's own
> > > >>file, not in arch/x86/kernel/pci-swiotlb.c
> > > 
> > > Actually, we piggyback on pci_swiotlb_detect_override which sets
> > > swiotlb=1 as x86_64_start_kernel() and invocation of sme_early_init()
> > > forces swiotlb on, but again this is all x86 architecture specific.
> 
> Then it looks like the best bet is to do it from within swiotlb_init?
> We really can't do it from swiotlb_size_or_default - that function
> should just return a value and nothing else.
> 

Actually, we need to do it in swiotlb_size_or_default() as this gets called by
reserve_crashkernel_low() in arch/x86/kernel/setup.c and used to
reserve low crashkernel memory. If we adjust swiotlb size later in
swiotlb_init() which gets called later than reserve_crashkernel_low(),
then any swiotlb size changes/expansion will conflict/overlap with the
low memory reserved for crashkernel.

Thanks,
Ashish
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V10 07/11] iommu/vt-d: Support flushing more translation cache types

2020-03-30 Thread Jacob Pan
On Fri, 27 Mar 2020 15:46:23 +0100
Auger Eric  wrote:

> Hi Jacob,
> 
> On 3/21/20 12:27 AM, Jacob Pan wrote:
> > When Shared Virtual Memory is exposed to a guest via vIOMMU,
> > scalable IOTLB invalidation may be passed down from outside IOMMU
> > subsystems. This patch adds invalidation functions that can be used
> > for additional translation cache types.
> > 
> > Signed-off-by: Jacob Pan 
> > 
> > ---
> > v9 -> v10:
> > Fix off by 1 in pasid device iotlb flush
> > 
> > Address v7 missed review from Eric
> > 
> > ---
> > ---
> >  drivers/iommu/dmar.c| 36
> >  drivers/iommu/intel-pasid.c |
> > 3 ++- include/linux/intel-iommu.h | 20 
> >  3 files changed, 54 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> > index f77dae7ba7d4..4d6b7b5b37ee 100644
> > --- a/drivers/iommu/dmar.c
> > +++ b/drivers/iommu/dmar.c
> > @@ -1421,6 +1421,42 @@ void qi_flush_piotlb(struct intel_iommu
> > *iommu, u16 did, u32 pasid, u64 addr, qi_submit_sync(&desc, iommu);
> >  }
> >  
> > +/* PASID-based device IOTLB Invalidate */
> > +void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, u16 sid,
> > u16 pfsid,
> > +   u32 pasid,  u16 qdep, u64 addr, unsigned
> > size_order, u64 granu) +{
> > +   unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size_order -
> > 1);
> > +   struct qi_desc desc = {.qw2 = 0, .qw3 = 0};
> > +
> > +   desc.qw0 = QI_DEV_EIOTLB_PASID(pasid) |
> > QI_DEV_EIOTLB_SID(sid) |
> > +   QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
> > +   QI_DEV_IOTLB_PFSID(pfsid);
> > +   desc.qw1 = QI_DEV_EIOTLB_GLOB(granu);
> > +
> > +   /*
> > +* If S bit is 0, we only flush a single page. If S bit is
> > set,
> > +* The least significant zero bit indicates the
> > invalidation address
> > +* range. VT-d spec 6.5.2.6.
> > +* e.g. address bit 12[0] indicates 8KB, 13[0] indicates
> > 16KB.
> > +* size order = 0 is PAGE_SIZE 4KB
> > +* Max Invs Pending (MIP) is set to 0 for now until we
> > have DIT in
> > +* ECAP.
> > +*/
> > +   desc.qw1 |= addr & ~mask;
> > +   if (size_order)
> > +   desc.qw1 |= QI_DEV_EIOTLB_SIZE;
> > +
> > +   qi_submit_sync(&desc, iommu);
> > +}
> > +
> > +void qi_flush_pasid_cache(struct intel_iommu *iommu, u16 did, u64
> > granu, int pasid) +{
> > +   struct qi_desc desc = {.qw1 = 0, .qw2 = 0, .qw3 = 0};
> > +
> > +   desc.qw0 = QI_PC_PASID(pasid) | QI_PC_DID(did) |
> > QI_PC_GRAN(granu) | QI_PC_TYPE;
> > +   qi_submit_sync(&desc, iommu);
> > +}
> > +
> >  /*
> >   * Disable Queued Invalidation interface.
> >   */
> > diff --git a/drivers/iommu/intel-pasid.c
> > b/drivers/iommu/intel-pasid.c index 10c7856afc6b..9f6d07410722
> > 100644 --- a/drivers/iommu/intel-pasid.c
> > +++ b/drivers/iommu/intel-pasid.c
> > @@ -435,7 +435,8 @@ pasid_cache_invalidation_with_pasid(struct
> > intel_iommu *iommu, {
> > struct qi_desc desc;
> >  
> > -   desc.qw0 = QI_PC_DID(did) | QI_PC_PASID_SEL |
> > QI_PC_PASID(pasid);
> > +   desc.qw0 = QI_PC_DID(did) | QI_PC_GRAN(QI_PC_PASID_SEL) |
> > +   QI_PC_PASID(pasid) | QI_PC_TYPE;  
> Just a nit, this fix is not documented in the commit message.
> 
Thanks, I just sent out this fix separately. Will remove this from the
set.
https://lkml.org/lkml/2020/3/30/1065

> Besides
> Reviewed-by: Eric Auger 
> 
> Thanks
> 
> Eric
> 
> > desc.qw1 = 0;
> > desc.qw2 = 0;
> > desc.qw3 = 0;
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index 85b05120940e..43539713b3b3
> > 100644 --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -334,7 +334,7 @@ enum {
> >  #define QI_IOTLB_GRAN(gran)(((u64)gran) >>
> > (DMA_TLB_FLUSH_GRANU_OFFSET-4)) #define QI_IOTLB_ADDR(addr)
> > (((u64)addr) & VTD_PAGE_MASK) #define
> > QI_IOTLB_IH(ih) (((u64)ih) << 6) -#define
> > QI_IOTLB_AM(am) (((u8)am)) +#define
> > QI_IOTLB_AM(am) (((u8)am) & 0x3f) 
> >  #define QI_CC_FM(fm)   (((u64)fm) << 48)
> >  #define QI_CC_SID(sid) (((u64)sid) << 32)
> > @@ -353,16 +353,21 @@ enum {
> >  #define QI_PC_DID(did) (((u64)did) << 16)
> >  #define QI_PC_GRAN(gran)   (((u64)gran) << 4)
> >  
> > -#define QI_PC_ALL_PASIDS   (QI_PC_TYPE | QI_PC_GRAN(0))
> > -#define QI_PC_PASID_SEL(QI_PC_TYPE | QI_PC_GRAN(1))
> > +/* PASID cache invalidation granu */
> > +#define QI_PC_ALL_PASIDS   0
> > +#define QI_PC_PASID_SEL1
> >  
> >  #define QI_EIOTLB_ADDR(addr)   ((u64)(addr) & VTD_PAGE_MASK)
> >  #define QI_EIOTLB_IH(ih)   (((u64)ih) << 6)
> > -#define QI_EIOTLB_AM(am)   (((u64)am))
> > +#define QI_EIOTLB_AM(am)   (((u64)am) & 0x3f)
> >  #define QI_EIOTLB_PASID(pasid) (((u64)pasid) << 32)
> >  #define QI_EIOTLB_DID(did) (((u64)did) << 16)
> >  #define QI_EIOTLB_GRAN(gran)   (((u64)gran) << 4)
> >  
> > +/* QI Dev-IOTLB inv granu */
> > +#define QI_DEV_IOTLB_GRAN_ALL  

[PATCH] iommu/vt-d: Fix PASID cache flush

2020-03-30 Thread Jacob Pan
PASID cache type and shift of granularity bits are missing in
the current code.

Fixes: 6f7db75e1c46 ("iommu/vt-d: Add second level page table
interface")

Cc: Eric Auger 
Signed-off-by: Jacob Pan 
---
 drivers/iommu/intel-pasid.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
index 22b30f10b396..57d05b0fbafc 100644
--- a/drivers/iommu/intel-pasid.c
+++ b/drivers/iommu/intel-pasid.c
@@ -365,7 +365,8 @@ pasid_cache_invalidation_with_pasid(struct intel_iommu 
*iommu,
 {
struct qi_desc desc;
 
-   desc.qw0 = QI_PC_DID(did) | QI_PC_PASID_SEL | QI_PC_PASID(pasid);
+   desc.qw0 = QI_PC_DID(did) | QI_PC_GRAN(QI_PC_PASID_SEL) |
+  QI_PC_PASID(pasid) | QI_PC_TYPE;
desc.qw1 = 0;
desc.qw2 = 0;
desc.qw3 = 0;
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support

2020-03-30 Thread Jacob Pan
On Sun, 29 Mar 2020 15:40:22 +0200
Auger Eric  wrote:

> Hi,
> 
> On 3/21/20 12:27 AM, Jacob Pan wrote:
> > When supporting guest SVA with emulated IOMMU, the guest PASID
> > table is shadowed in VMM. Updates to guest vIOMMU PASID table
> > will result in PASID cache flush which will be passed down to
> > the host as bind guest PASID calls.
> > 
> > For the SL page tables, it will be harvested from device's
> > default domain (request w/o PASID), or aux domain in case of
> > mediated device.
> > 
> > .-.  .---.
> > |   vIOMMU|  | Guest process CR3, FL only|
> > | |  '---'
> > ./
> > | PASID Entry |--- PASID cache flush -
> > '-'   |
> > | |   V
> > | |CR3 in GPA
> > '-'
> > Guest
> > --| Shadow |--|
> >   vv  v
> > Host
> > .-.  .--.
> > |   pIOMMU|  | Bind FL for GVA-GPA  |
> > | |  '--'
> > ./  |
> > | PASID Entry | V (Nested xlate)
> > '\.--.
> > | |   |SL for GPA-HPA, default domain|
> > | |   '--'
> > '-'
> > Where:
> >  - FL = First level/stage one page tables
> >  - SL = Second level/stage two page tables
> > 
> > Signed-off-by: Jacob Pan 
> > Signed-off-by: Liu, Yi L 
> > ---
> >  drivers/iommu/intel-iommu.c |   4 +
> >  drivers/iommu/intel-svm.c   | 224
> > 
> > include/linux/intel-iommu.h |   8 +- include/linux/intel-svm.h   |
> > 17  4 files changed, 252 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index e599b2537b1c..b1477cd423dd
> > 100644 --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -6203,6 +6203,10 @@ const struct iommu_ops intel_iommu_ops = {
> > .dev_disable_feat   = intel_iommu_dev_disable_feat,
> > .is_attach_deferred =
> > intel_iommu_is_attach_deferred, .pgsize_bitmap  =
> > INTEL_IOMMU_PGSIZES, +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +   .sva_bind_gpasid= intel_svm_bind_gpasid,
> > +   .sva_unbind_gpasid  = intel_svm_unbind_gpasid,
> > +#endif
> >  };
> >  
> >  static void quirk_iommu_igfx(struct pci_dev *dev)
> > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > index d7f2a5358900..47c0deb5ae56 100644
> > --- a/drivers/iommu/intel-svm.c
> > +++ b/drivers/iommu/intel-svm.c
> > @@ -226,6 +226,230 @@ static LIST_HEAD(global_svm_list);
> > list_for_each_entry((sdev), &(svm)->devs, list) \
> > if ((d) != (sdev)->dev) {} else
> >  
> > +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > +   struct device *dev,
> > +   struct iommu_gpasid_bind_data *data)
> > +{
> > +   struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > +   struct dmar_domain *ddomain;
> > +   struct intel_svm_dev *sdev;
> > +   struct intel_svm *svm;
> > +   int ret = 0;
> > +
> > +   if (WARN_ON(!iommu) || !data)
> > +   return -EINVAL;
> > +
> > +   if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> > +   data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> > +   return -EINVAL;
> > +
> > +   if (dev_is_pci(dev)) {
> > +   /* VT-d supports devices with full 20 bit PASIDs
> > only */
> > +   if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> > +   return -EINVAL;
> > +   } else {
> > +   return -ENOTSUPP;
> > +   }
> > +
> > +   /*
> > +* We only check host PASID range, we have no knowledge to
> > check
> > +* guest PASID range nor do we use the guest PASID.  
> nit : "nor do we use the guest PASID". Well the guest PASID FLAG is
> checked below and if set, svm->gpasid is set ;-)
Yes, it is a little contradictory, I will remove the use.

I meant we don;t really use the gpasid for real work in host driver :)

> > +*/
> > +   if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> > +   return -EINVAL;
> > +
> > +   ddomain = to_dmar_domain(domain);
> > +
> > +   /* Sanity check paging mode support match between host and
> > guest */
> > +   if (data->addr_width == ADDR_WIDTH_5LEVEL &&
> > +   !cap_5lp_support(iommu->cap)) {
> > +   pr_err("Cannot support 5 level paging requested by
> > guest!\n");
> > +   return -EINVAL;  
> nit: This check also is done in intel_pasid_setup_nested with an extra
> check:
Good catch, I will remove this.

> + switch (addr_width) {
> + case ADDR_WIDTH_5LEVEL:
> + if (cpu_feature_enabled(X86_FEATURE_LA57) &&
> + cap_5lp_support(iommu->cap)) {
> 
> > +   }
> > +
> >

Re: [PATCH v2 1/3] dt-bindings: remoteproc: qcom: Add iommus property

2020-03-30 Thread Rob Herring
On Tue, 17 Mar 2020 20:39:08 +0530, Sibi Sankar wrote:
> Add iommus property to allow Q6 modem to boot on platforms which do
> not have trustZone.
> 
> Signed-off-by: Sibi Sankar 
> ---
>  Documentation/devicetree/bindings/remoteproc/qcom,q6v5.txt | 3 +++
>  1 file changed, 3 insertions(+)
> 

Acked-by: Rob Herring 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V10 06/11] iommu/vt-d: Add bind guest PASID support

2020-03-30 Thread Jacob Pan
On Sat, 28 Mar 2020 08:02:01 +
"Tian, Kevin"  wrote:

> > From: Jacob Pan 
> > Sent: Saturday, March 21, 2020 7:28 AM
> > 
> > When supporting guest SVA with emulated IOMMU, the guest PASID
> > table is shadowed in VMM. Updates to guest vIOMMU PASID table
> > will result in PASID cache flush which will be passed down to
> > the host as bind guest PASID calls.
> > 
> > For the SL page tables, it will be harvested from device's
> > default domain (request w/o PASID), or aux domain in case of
> > mediated device.
> > 
> > .-.  .---.
> > |   vIOMMU|  | Guest process CR3, FL only|
> > | |  '---'
> > ./
> > | PASID Entry |--- PASID cache flush -
> > '-'   |
> > | |   V
> > | |CR3 in GPA
> > '-'
> > Guest
> > --| Shadow |--|
> >   vv  v
> > Host
> > .-.  .--.
> > |   pIOMMU|  | Bind FL for GVA-GPA  |
> > | |  '--'
> > ./  |
> > | PASID Entry | V (Nested xlate)
> > '\.--.
> > | |   |SL for GPA-HPA, default domain|
> > | |   '--'
> > '-'
> > Where:
> >  - FL = First level/stage one page tables
> >  - SL = Second level/stage two page tables
> > 
> > Signed-off-by: Jacob Pan 
> > Signed-off-by: Liu, Yi L 
> > ---
> >  drivers/iommu/intel-iommu.c |   4 +
> >  drivers/iommu/intel-svm.c   | 224
> > 
> >  include/linux/intel-iommu.h |   8 +-
> >  include/linux/intel-svm.h   |  17 
> >  4 files changed, 252 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index e599b2537b1c..b1477cd423dd
> > 100644 --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -6203,6 +6203,10 @@ const struct iommu_ops intel_iommu_ops = {
> > .dev_disable_feat   = intel_iommu_dev_disable_feat,
> > .is_attach_deferred =
> > intel_iommu_is_attach_deferred, .pgsize_bitmap  =
> > INTEL_IOMMU_PGSIZES, +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +   .sva_bind_gpasid= intel_svm_bind_gpasid,
> > +   .sva_unbind_gpasid  = intel_svm_unbind_gpasid,
> > +#endif
> >  };
> > 
> >  static void quirk_iommu_igfx(struct pci_dev *dev)
> > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> > index d7f2a5358900..47c0deb5ae56 100644
> > --- a/drivers/iommu/intel-svm.c
> > +++ b/drivers/iommu/intel-svm.c
> > @@ -226,6 +226,230 @@ static LIST_HEAD(global_svm_list);
> > list_for_each_entry((sdev), &(svm)->devs, list) \
> > if ((d) != (sdev)->dev) {} else
> > 
> > +int intel_svm_bind_gpasid(struct iommu_domain *domain,
> > +   struct device *dev,
> > +   struct iommu_gpasid_bind_data *data)
> > +{
> > +   struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
> > +   struct dmar_domain *ddomain;  
> 
> what about the full name e.g. dmar_domain? though a bit longer
> but clearer than ddomain.
> 
Sure, I don't have preference.

> > +   struct intel_svm_dev *sdev;
> > +   struct intel_svm *svm;
> > +   int ret = 0;
> > +
> > +   if (WARN_ON(!iommu) || !data)
> > +   return -EINVAL;
> > +
> > +   if (data->version != IOMMU_GPASID_BIND_VERSION_1 ||
> > +   data->format != IOMMU_PASID_FORMAT_INTEL_VTD)
> > +   return -EINVAL;
> > +
> > +   if (dev_is_pci(dev)) {
> > +   /* VT-d supports devices with full 20 bit PASIDs
> > only */
> > +   if (pci_max_pasids(to_pci_dev(dev)) != PASID_MAX)
> > +   return -EINVAL;
> > +   } else {
> > +   return -ENOTSUPP;
> > +   }
> > +
> > +   /*
> > +* We only check host PASID range, we have no knowledge to
> > check
> > +* guest PASID range nor do we use the guest PASID.
> > +*/
> > +   if (data->hpasid <= 0 || data->hpasid >= PASID_MAX)
> > +   return -EINVAL;
> > +
> > +   ddomain = to_dmar_domain(domain);
> > +
> > +   /* Sanity check paging mode support match between host and
> > guest */
> > +   if (data->addr_width == ADDR_WIDTH_5LEVEL &&
> > +   !cap_5lp_support(iommu->cap)) {
> > +   pr_err("Cannot support 5 level paging requested by
> > guest!\n");
> > +   return -EINVAL;
> > +   }  
> 
> -ENOTSUPP?
I was thinking from this API p.o.v, the input is invalid. Since both
cap and addr_width are derived from input arguments.

> 
> > +
> > +   mutex_lock(&pasid_mutex);
> > +   svm = ioasid_find(NULL, data->hpasid, NULL);
> > +   if (IS_ERR(svm)) {
> > +   ret = PTR_ERR(svm);
> > +   goto out;
> > +   }
> > +
> > +   if (svm) {
> > +   /*
> > +* 

Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Alexander Graf via iommu




On 30.03.20 15:40, Konrad Rzeszutek Wilk wrote:




On Mon, Mar 30, 2020 at 02:06:01PM +0800, Kairui Song wrote:

On Sat, Mar 28, 2020 at 7:57 PM Dave Young  wrote:


On 03/26/20 at 05:29pm, Alexander Graf wrote:

The swiotlb is a very convenient fallback mechanism for bounce buffering of
DMAable data. It is usually used for the compatibility case where devices
can only DMA to a "low region".

However, in some scenarios this "low region" may be bound even more
heavily. For example, there are embedded system where only an SRAM region
is shared between device and CPU. There are also heterogeneous computing
scenarios where only a subset of RAM is cache coherent between the
components of the system. There are partitioning hypervisors, where
a "control VM" that implements device emulation has limited view into a
partition's memory for DMA capabilities due to safety concerns.

This patch adds a command line driven mechanism to move all DMA memory into
a predefined shared memory region which may or may not be part of the
physical address layout of the Operating System.

Ideally, the typical path to set this configuration would be through Device
Tree or ACPI, but neither of the two mechanisms is standardized yet. Also,
in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
instead configure the system purely through kernel command line options.

I'm sure other people will find the functionality useful going forward
though and extend it to be triggered by DT/ACPI in the future.


Hmm, we have a use case for kdump, this maybe useful.  For example
swiotlb is enabled by default if AMD SME/SEV is active, and in kdump
kernel we have to increase the crashkernel reserved size for the extra
swiotlb requirement.  I wonder if we can just reuse the old kernel's
swiotlb region and pass the addr to kdump kernel.



Yes, definitely helpful for kdump kernel. This can help reduce the
crashkernel value.

Previously I was thinking about something similar, play around the
e820 entry passed to kdump and let it place swiotlb in wanted region.
Simply remap it like in this patch looks much cleaner.

If this patch is acceptable, one more patch is needed to expose the
swiotlb in iomem, so kexec-tools can pass the right kernel cmdline to
second kernel.


We seem to be passsing a lot of data to kexec.. Perhaps something
of a unified way since we seem to have a lot of things to pass - disabling
IOMMU, ACPI RSDT address, and then this.

CC-ing Anthony who is working on something - would you by any chance
have a doc on this?



I see in general 2 use cases here:


1) Allow for a generic mechanism to have the fully system, individual 
buses, devices or functions of a device go through a particular, 
self-contained bounce buffer.


This sounds like the holy grail to a lot of problems. It would solve 
typical embedded scenarios where you only have a shared SRAM. It solves 
the safety case (to some extent) where you need to ensure that one 
device interaction doesn't conflict with another device interaction. It 
also solves the problem I've tried to solve with the patch here.


It's unfortunately a lot harder than the patch I sent, so it will take 
me some time to come up with a working patch set.. I suppose starting 
with a DT binding only is sensible. Worst case, x86 does also support DT ...


(And yes, I'm always happy to review patches if someone else beats me to it)


2) Reuse the SWIOTLB from the previous boot on kexec/kdump

I see little direct relation to SEV here. The only reason SEV makes it 
more relevant, is that you need to have an SWIOTLB region available with 
SEV while without you could live with a disabled IOMMU.


However, I can definitely understand how you would want to have a way to 
tell the new kexec'ed kernel where the old SWIOTLB was, so it can reuse 
its memory for its own SWIOTLB. That way, you don't have to reserve 
another 64MB of RAM for kdump.


What I'm curious on is whether we need to be as elaborate. Can't we just 
pass the old SWIOTLB as free memory to the new kexec'ed kernel and 
everything else will fall into place? All that would take is a bit of 
shuffling on the e820 table pass-through to the kexec'ed kernel, no?



Thanks,

Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 7/7] x86/process: Clear PASID state for a newly forked/cloned thread

2020-03-30 Thread Fenghua Yu
The PASID state has to be cleared on forks, since the child has a
different address space. The PASID is also cleared for thread clone. While
it would be correct to inherit the PASID in this case, it is unknown
whether the new task will use ENQCMD. Giving it the PASID "just in case"
would have the downside of increased context switch overhead to setting
the PASID MSR.

Since #GP faults have to be handled on any threads that were created before
the PASID was assigned to the mm of the process, newly created threads
might as well be treated in a consistent way.

Suggested-by: Thomas Gleixner 
Signed-off-by: Fenghua Yu 
Reviewed-by: Tony Luck 
---
 arch/x86/kernel/process.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 87de18c64cf5..cefdc8f7fc13 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -122,6 +122,16 @@ static int set_new_tls(struct task_struct *p, unsigned 
long tls)
return do_set_thread_area_64(p, ARCH_SET_FS, tls);
 }
 
+/* Clear PASID MSR/state for the forked/cloned thread. */
+static void clear_task_pasid(struct task_struct *task)
+{
+   /*
+* Clear the xfeatures bit in the PASID state so that the MSR will be
+* initialized to its init state (0) by XRSTORS.
+*/
+   task->thread.fpu.state.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID;
+}
+
 int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
unsigned long arg, struct task_struct *p, unsigned long tls)
 {
@@ -175,6 +185,9 @@ int copy_thread_tls(unsigned long clone_flags, unsigned 
long sp,
task_user_gs(p) = get_user_gs(current_pt_regs());
 #endif
 
+   if (static_cpu_has(X86_FEATURE_ENQCMD))
+   clear_task_pasid(p);
+
/* Set a new TLS for the child thread? */
if (clone_flags & CLONE_SETTLS)
ret = set_new_tls(p, tls);
-- 
2.19.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 4/7] x86/msr-index: Define IA32_PASID MSR

2020-03-30 Thread Fenghua Yu
The IA32_PASID MSR (0xd93) contains the Process Address Space Identifier
(PASID), a 20-bit value. Bit 31 must be set to indicate the value
programmed in the MSR is valid. Hardware uses PASID to identify which
process submits the work and direct responses to the right process.

Signed-off-by: Fenghua Yu 
Reviewed-by: Tony Luck 
---
 arch/x86/include/asm/msr-index.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index d5e517d1c3dd..ebda24839dc5 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -224,6 +224,9 @@
 #define MSR_IA32_LASTINTFROMIP 0x01dd
 #define MSR_IA32_LASTINTTOIP   0x01de
 
+#define MSR_IA32_PASID 0x0d93
+#define MSR_IA32_PASID_VALID   BIT_ULL(31)
+
 /* DEBUGCTLMSR bits (others vary by model): */
 #define DEBUGCTLMSR_LBR(1UL <<  0) /* last branch 
recording */
 #define DEBUGCTLMSR_BTF_SHIFT  1
-- 
2.19.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 0/7] x86: tag application address space for devices

2020-03-30 Thread Fenghua Yu
Typical hardware devices require a driver stack to translate application
buffers to hardware addresses, and a kernel-user transition to notify the
hardware of new work. What if both the translation and transition overhead
could be eliminated? This is what Shared Virtual Address (SVA) and ENQCMD
enabled hardware like Data Streaming Accelerator (DSA) aims to achieve.
Applications map portals in their local-address-space and directly submit
work to them using a new instruction.

This series implements management of a new MSR (MSR_IA32_PASID). This new
MSR allows an application address space to be associated with what the PCIe
spec calls a Process Address Space ID (PASID). This PASID tag is carried
along with all requests between applications and devices and allows devices
to interact with the process address space.

SVA and ENQCMD enabled device drivers will use this series in the future.
For example, it will be used by the phase 2 DSA driver which will be
released with SVA and ENQCMD support as explained in:
https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator

This series only provides simple and basic support for the MSR as follows:
1. Explain different various technical terms used in the series (patch 1).
2. Enumerate support for ENQCMD in the processor (patch 2).
3. Handle FPU PASID state and the MSR during context switch (patches 3-4).
4. Allocate and free PASID for a process (patch 5).
5. Fix up the PASID MSR in #GP handler when one thread in a process
   executes ENQCMD for the first time (patches 6).
6. Clear PASID state for forked and cloned thread (patch 7).

And this patch series needs support from supervisor states patch set:
https://lore.kernel.org/lkml/20200328164307.17497-1-yu-cheng...@intel.com/

The v3 supervisor states series, this patch series, and DSA phase 2 series
(to be released shortly in idxd driver) can be cloned from:
https://github.com/intel/idxd-driver.git idxd-stage2

References:
1. Detailed information on the ENQCMD/ENQCMDS instructions and the
IA32_PASID MSR can be found in Intel Architecture Instruction Set
Extensions and Future Features Programming Reference:
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf

2. Detailed information on DSA can be found in DSA specification:
https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification

Ashok Raj (1):
  docs: x86: Add a documentation for ENQCMD

Fenghua Yu (5):
  x86/cpufeatures: Enumerate ENQCMD and ENQCMDS instructions
  x86/msr-index: Define IA32_PASID MSR
  x86/mmu: Allocate/free PASID
  x86/traps: Fix up invalid PASID
  x86/process: Clear PASID state for a newly forked/cloned thread

Yu-cheng Yu (1):
  x86/fpu/xstate: Add supervisor PASID state for ENQCMD feature

 Documentation/x86/enqcmd.rst   | 185 +
 arch/x86/include/asm/cpufeatures.h |   1 +
 arch/x86/include/asm/fpu/types.h   |  10 ++
 arch/x86/include/asm/fpu/xstate.h  |   2 +-
 arch/x86/include/asm/iommu.h   |   3 +
 arch/x86/include/asm/mmu.h |   4 +
 arch/x86/include/asm/mmu_context.h |  14 +++
 arch/x86/include/asm/msr-index.h   |   3 +
 arch/x86/kernel/cpu/cpuid-deps.c   |   1 +
 arch/x86/kernel/fpu/xstate.c   |   4 +
 arch/x86/kernel/process.c  |  13 ++
 arch/x86/kernel/traps.c|  17 +++
 drivers/iommu/intel-svm.c  | 119 +--
 13 files changed, 367 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/x86/enqcmd.rst

-- 
2.19.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 3/7] x86/fpu/xstate: Add supervisor PASID state for ENQCMD feature

2020-03-30 Thread Fenghua Yu
From: Yu-cheng Yu 

The IA32_PASID MSR is used when a task submits work via the ENQCMD
instruction. The per task MSR is stored in the task's supervisor FPU
PASID state and is context switched by XSAVES/XRSTORS.

Signed-off-by: Yu-cheng Yu 
Co-developed-by: Fenghua Yu 
Signed-off-by: Fenghua Yu 
Reviewed-by: Tony Luck 
---
 arch/x86/include/asm/fpu/types.h  | 10 ++
 arch/x86/include/asm/fpu/xstate.h |  2 +-
 arch/x86/kernel/fpu/xstate.c  |  4 
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index f098f6cab94b..00f8efd4c07d 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -114,6 +114,7 @@ enum xfeature {
XFEATURE_Hi16_ZMM,
XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
XFEATURE_PKRU,
+   XFEATURE_PASID,
 
XFEATURE_MAX,
 };
@@ -128,6 +129,7 @@ enum xfeature {
 #define XFEATURE_MASK_Hi16_ZMM (1 << XFEATURE_Hi16_ZMM)
 #define XFEATURE_MASK_PT   (1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
 #define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
+#define XFEATURE_MASK_PASID(1 << XFEATURE_PASID)
 
 #define XFEATURE_MASK_FPSSE(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
 #define XFEATURE_MASK_AVX512   (XFEATURE_MASK_OPMASK \
@@ -229,6 +231,14 @@ struct pkru_state {
u32 pad;
 } __packed;
 
+/*
+ * State component 10 is supervisor state used for context-switching the
+ * PASID state.
+ */
+struct ia32_pasid_state {
+   u64 pasid;
+} __packed;
+
 struct xstate_header {
u64 xfeatures;
u64 xcomp_bv;
diff --git a/arch/x86/include/asm/fpu/xstate.h 
b/arch/x86/include/asm/fpu/xstate.h
index 422d8369012a..ab9833c57aaa 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -33,7 +33,7 @@
  XFEATURE_MASK_BNDCSR)
 
 /* All currently supported supervisor features */
-#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (0)
+#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID)
 
 /*
  * Unsupported supervisor features. When a supervisor feature in this mask is
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 7d0a9f878b26..8724675532de 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -37,6 +37,7 @@ static const char *xfeature_names[] =
"AVX-512 ZMM_Hi256" ,
"Processor Trace (unused)"  ,
"Protection Keys User registers",
+   "PASID state",
"unknown xstate feature",
 };
 
@@ -51,6 +52,7 @@ static short xsave_cpuid_features[] __initdata = {
X86_FEATURE_AVX512F,
X86_FEATURE_INTEL_PT,
X86_FEATURE_PKU,
+   X86_FEATURE_ENQCMD,
 };
 
 /*
@@ -316,6 +318,7 @@ static void __init print_xstate_features(void)
print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
print_xstate_feature(XFEATURE_MASK_PKRU);
+   print_xstate_feature(XFEATURE_MASK_PASID);
 }
 
 /*
@@ -590,6 +593,7 @@ static void check_xstate_against_struct(int nr)
XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
XCHECK_SZ(sz, nr, XFEATURE_PKRU,  struct pkru_state);
+   XCHECK_SZ(sz, nr, XFEATURE_PASID, struct ia32_pasid_state);
 
/*
 * Make *SURE* to add any feature numbers in below if
-- 
2.19.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 5/7] x86/mmu: Allocate/free PASID

2020-03-30 Thread Fenghua Yu
PASID is shared by all threads in a process. So the logical place to keep
track of it is in the "mm". Add the field to the architecture specific
mm_context_t structure.

A PASID is allocated for an "mm" the first time any thread attaches
to an SVM capable device. Later device atatches (whether to the same
device or another SVM device) will re-use the same PASID.

The PASID is freed when the process exits (so no need to keep
reference counts on how many SVM devices are sharing the PASID).

Signed-off-by: Fenghua Yu 
Reviewed-by: Tony Luck 
---
 arch/x86/include/asm/iommu.h   |  2 +
 arch/x86/include/asm/mmu.h |  4 ++
 arch/x86/include/asm/mmu_context.h | 14 +
 drivers/iommu/intel-svm.c  | 82 +++---
 4 files changed, 94 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
index bf1ed2ddc74b..ed41259fe7ac 100644
--- a/arch/x86/include/asm/iommu.h
+++ b/arch/x86/include/asm/iommu.h
@@ -26,4 +26,6 @@ arch_rmrr_sanity_check(struct acpi_dmar_reserved_memory *rmrr)
return -EINVAL;
 }
 
+void __free_pasid(struct mm_struct *mm);
+
 #endif /* _ASM_X86_IOMMU_H */
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index bdeae9291e5c..137bf51f19e6 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -50,6 +50,10 @@ typedef struct {
u16 pkey_allocation_map;
s16 execute_only_pkey;
 #endif
+
+#ifdef CONFIG_INTEL_IOMMU_SVM
+   int pasid;
+#endif
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(mm)\
diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index b538d9ddee9c..1c020c7955e6 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 extern atomic64_t last_mm_ctx_id;
 
@@ -129,9 +130,22 @@ static inline int init_new_context(struct task_struct *tsk,
init_new_context_ldt(mm);
return 0;
 }
+
+static inline void free_pasid(struct mm_struct *mm)
+{
+   if (!IS_ENABLED(CONFIG_INTEL_IOMMU_SVM))
+   return;
+
+   if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
+   return;
+
+   __free_pasid(mm);
+}
+
 static inline void destroy_context(struct mm_struct *mm)
 {
destroy_context_ldt(mm);
+   free_pasid(mm);
 }
 
 extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index d7f2a5358900..da718a49e91e 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -226,6 +226,45 @@ static LIST_HEAD(global_svm_list);
list_for_each_entry((sdev), &(svm)->devs, list) \
if ((d) != (sdev)->dev) {} else
 
+/*
+ * If this mm already has a PASID we can use it. Otherwise allocate a new one.
+ * Let the caller know if we did an allocation via 'new_pasid'.
+ */
+static int alloc_pasid(struct intel_svm *svm, struct mm_struct *mm,
+  int pasid_max,  bool *new_pasid, int flags)
+{
+   int pasid;
+
+   /*
+* Reuse the PASID if the mm already has a PASID and not a private
+* PASID is requested.
+*/
+   if (mm && mm->context.pasid && !(flags & SVM_FLAG_PRIVATE_PASID)) {
+   /*
+* Once a PASID is allocated for this mm, the PASID
+* stays with the mm until the mm is dropped. Reuse
+* the PASID which has been already allocated for the
+* mm instead of allocating a new one.
+*/
+   ioasid_set_data(mm->context.pasid, svm);
+   *new_pasid = false;
+
+   return mm->context.pasid;
+   }
+
+   /*
+* Allocate a new pasid. Do not use PASID 0, reserved for RID to
+* PASID.
+*/
+   pasid = ioasid_alloc(NULL, PASID_MIN, pasid_max - 1, svm);
+   if (pasid == INVALID_IOASID)
+   return -ENOSPC;
+
+   *new_pasid = true;
+
+   return pasid;
+}
+
 int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct 
svm_dev_ops *ops)
 {
struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
@@ -324,6 +363,8 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int 
flags, struct svm_dev_
init_rcu_head(&sdev->rcu);
 
if (!svm) {
+   bool new_pasid;
+
svm = kzalloc(sizeof(*svm), GFP_KERNEL);
if (!svm) {
ret = -ENOMEM;
@@ -335,15 +376,13 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int 
flags, struct svm_dev_
if (pasid_max > intel_pasid_max_id)
pasid_max = intel_pasid_max_id;
 
-   /* Do not use PASID 0, reserved for RID to PASID */
-   svm->pasid = ioasid_alloc(NULL, PASID_MIN,
- pasid_max - 1, svm);
- 

[PATCH 6/7] x86/traps: Fix up invalid PASID

2020-03-30 Thread Fenghua Yu
A #GP fault is generated when ENQCMD instruction is executed without
a valid PASID value programmed in. The #GP fault handler will initialize
the current thread's PASID MSR.

The following heuristic is used to avoid decoding the user instructions
to determine the precise reason for the #GP fault:
1) If the mm for the process has not been allocated a PASID, this #GP
   cannot be fixed.
2) If the PASID MSR is already initialized, then the #GP was for some
   other reason
3) Try initializing the PASID MSR and returning. If the #GP was from
   an ENQCMD this will fix it. If not, the #GP fault will be repeated
   and we will hit case "2".

Suggested-by: Thomas Gleixner 
Signed-off-by: Fenghua Yu 
Reviewed-by: Tony Luck 
---
 arch/x86/include/asm/iommu.h |  1 +
 arch/x86/kernel/traps.c  | 17 +
 drivers/iommu/intel-svm.c| 37 
 3 files changed, 55 insertions(+)

diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
index ed41259fe7ac..e9365a5d6f7d 100644
--- a/arch/x86/include/asm/iommu.h
+++ b/arch/x86/include/asm/iommu.h
@@ -27,5 +27,6 @@ arch_rmrr_sanity_check(struct acpi_dmar_reserved_memory *rmrr)
 }
 
 void __free_pasid(struct mm_struct *mm);
+bool __fixup_pasid_exception(void);
 
 #endif /* _ASM_X86_IOMMU_H */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6ef00eb6fbb9..369b5ba94635 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_X86_64
 #include 
@@ -488,6 +489,16 @@ static enum kernel_gp_hint get_kernel_gp_address(struct 
pt_regs *regs,
return GP_CANONICAL;
 }
 
+static bool fixup_pasid_exception(void)
+{
+   if (!IS_ENABLED(CONFIG_INTEL_IOMMU_SVM))
+   return false;
+   if (!static_cpu_has(X86_FEATURE_ENQCMD))
+   return false;
+
+   return __fixup_pasid_exception();
+}
+
 #define GPFSTR "general protection fault"
 
 dotraplinkage void do_general_protection(struct pt_regs *regs, long error_code)
@@ -499,6 +510,12 @@ dotraplinkage void do_general_protection(struct pt_regs 
*regs, long error_code)
int ret;
 
RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+
+   if (user_mode(regs) && fixup_pasid_exception()) {
+   cond_local_irq_enable(regs);
+   return;
+   }
+
cond_local_irq_enable(regs);
 
if (static_cpu_has(X86_FEATURE_UMIP)) {
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index da718a49e91e..5ed39a022adb 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -759,3 +759,40 @@ void __free_pasid(struct mm_struct *mm)
 */
ioasid_free(pasid);
 }
+
+/*
+ * Fix up the PASID MSR if possible.
+ *
+ * But if the #GP was due to another reason, a second #GP might be triggered
+ * to force proper behavior.
+ */
+bool __fixup_pasid_exception(void)
+{
+   struct mm_struct *mm;
+   bool ret = true;
+   u64 pasid_msr;
+   int pasid;
+
+   mm = get_task_mm(current);
+   /* This #GP was triggered from user mode. So mm cannot be NULL. */
+   pasid = mm->context.pasid;
+   /* Ensure this process has been bound to a PASID. */
+   if (!pasid) {
+   ret = false;
+   goto out;
+   }
+
+   /* Check to see if the PASID MSR has already been set for this task. */
+   rdmsrl(MSR_IA32_PASID, pasid_msr);
+   if (pasid_msr & MSR_IA32_PASID_VALID) {
+   ret = false;
+   goto out;
+   }
+
+   /* Fix up the MSR. */
+   wrmsrl(MSR_IA32_PASID, pasid | MSR_IA32_PASID_VALID);
+out:
+   mmput(mm);
+
+   return ret;
+}
-- 
2.19.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/7] x86/cpufeatures: Enumerate ENQCMD and ENQCMDS instructions

2020-03-30 Thread Fenghua Yu
A user space application can execute ENQCMD instruction to submit work
to device. The kernel executes ENQCMDS instruction to submit work to
device.

There is a lot of other enabling needed for the instructions to actually
be usable in user space and the kernel, and that enabling is coming later
in the series and in device drivers.

The CPU feature flag is shown as "enqcmd" in /proc/cpuinfo.

Signed-off-by: Fenghua Yu 
Reviewed-by: Tony Luck 
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/kernel/cpu/cpuid-deps.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index f3327cb56edf..d12ee3be1b93 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -349,6 +349,7 @@
 #define X86_FEATURE_CLDEMOTE   (16*32+25) /* CLDEMOTE instruction */
 #define X86_FEATURE_MOVDIRI(16*32+27) /* MOVDIRI instruction */
 #define X86_FEATURE_MOVDIR64B  (16*32+28) /* MOVDIR64B instruction */
+#define X86_FEATURE_ENQCMD (16*32+29) /* ENQCMD and ENQCMDS 
instructions */
 
 /* AMD-defined CPU features, CPUID level 0x8007 (EBX), word 17 */
 #define X86_FEATURE_OVERFLOW_RECOV (17*32+ 0) /* MCA overflow recovery 
support */
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index 3cbe24ca80ab..3a02707c1f4d 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -69,6 +69,7 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_CQM_MBM_TOTAL,X86_FEATURE_CQM_LLC   },
{ X86_FEATURE_CQM_MBM_LOCAL,X86_FEATURE_CQM_LLC   },
{ X86_FEATURE_AVX512_BF16,  X86_FEATURE_AVX512VL  },
+   { X86_FEATURE_ENQCMD,   X86_FEATURE_XSAVES},
{}
 };
 
-- 
2.19.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/7] docs: x86: Add a documentation for ENQCMD

2020-03-30 Thread Fenghua Yu
From: Ashok Raj 

ENQCMD and Data Streaming Accelerator (DSA) and all of their associated
features are a complicated stack with lots of interconnected pieces.
This documentation provides a big picture overview for all of the
features.

Signed-off-by: Ashok Raj 
Co-developed-by: Fenghua Yu 
Signed-off-by: Fenghua Yu 
Reviewed-by: Tony Luck 
---
 Documentation/x86/enqcmd.rst | 185 +++
 1 file changed, 185 insertions(+)
 create mode 100644 Documentation/x86/enqcmd.rst

diff --git a/Documentation/x86/enqcmd.rst b/Documentation/x86/enqcmd.rst
new file mode 100644
index ..414ef7d24028
--- /dev/null
+++ b/Documentation/x86/enqcmd.rst
@@ -0,0 +1,185 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Improved Device Interaction Overview
+
+== Background ==
+
+Shared Virtual Addressing (SVA) allows the processor and device to use the
+same virtual addresses avoiding the need for software to translate virtual
+addresses to physical addresses. ENQCMD is a new instruction on Intel
+platforms that allows user applications to directly notify hardware of new
+work, much like doorbells are used in some hardware, but carries a payload
+that carries the PASID and some additional device specific commands
+along with it.
+
+== Address Space Tagging ==
+
+A new MSR (MSR_IA32_PASID) allows an application address space to be
+associated with what the PCIe spec calls a Process Address Space ID
+(PASID). This PASID tag is carried along with all requests between
+applications and devices and allows devices to interact with the process
+address space.
+
+This MSR is managed with the XSAVE feature set as "supervisor state".
+
+== PASID Management ==
+
+The kernel must allocate a PASID on behalf of each process and program it
+into the new MSR to communicate the process identity to platform hardware.
+ENQCMD uses the PASID stored in this MSR to tag requests from this process.
+Requests for DMA from the device are also tagged with the same PASID. The
+platform IOMMU uses the PASID in the transaction to perform address
+translation. The IOMMU api's setup the corresponding PASID entry in IOMMU
+with the process address used by the CPU (for e.g cr3 in x86).
+
+The MSR must be configured on each logical CPU before any application
+thread can interact with a device.  Threads that belong to the same
+process share the same page tables, thus the same MSR value.
+
+The PASID allocation and MSR programming may occur long after a process and
+its threads have been created. If a thread uses ENQCMD without the MSR
+first being populated, it will #GP.  The kernel will fix up the #GP by
+writing the process-wide PASID into the thread that took the #GP. A single
+process PASID can be used simultaneously with multiple devices since they
+all share the same address space.
+
+New threads could inherit the MSR value from the parent. But this would
+involve additional state management for those threads which may never use
+ENQCMD. Clearing the MSR at thread creation permits all threads to have a
+consistent behavior; the PASID is only programmed when the thread calls
+ENQCMD for the first time.
+
+Although ENQCMD can be executed in the kernel, there isn't any usage yet.
+Currently #GP handler doesn't fix up #GP triggered from ENQCMD executed
+in the kernel
+
+== Relationships ==
+
+ * Each process has many threads, but only one PASID
+ * Devices have a limited number (~10's to 1000's) of hardware
+   workqueues and each portal maps down to a single workqueue.
+   The device driver manages allocating hardware workqueues.
+ * A single mmap() maps a single hardware workqueue as a "portal"
+ * For each device with which a process interacts, there must be
+   one or more mmap()'d portals.
+ * Many threads within a process can share a single portal to access
+   a single device.
+ * Multiple processes can separately mmap() the same portal, in
+   which case they still share one device hardware workqueue.
+ * The single process-wide PASID is used by all threads to interact
+   with all devices.  There is not, for instance, a PASID for each
+   thread or each thread<->device pair.
+
+== FAQ ==
+
+* What is SVA/SVM?
+
+Shared Virtual Addressing (SVA) permits I/O hardware and the processor to
+work in the same address space. In short, sharing the address space. Some
+call it Shared Virtual Memory (SVM), but Linux community wanted to avoid
+it with Posix Shared Memory and Secure Virtual Machines which were terms
+already in circulation.
+
+* What is a PASID?
+
+A Process Address Space ID (PASID) is a PCIe-defined TLP Prefix. A PASID is
+a 20 bit number allocated and managed by the OS. PASID is included in all
+transactions between the platform and the device.
+
+* How are shared work queues different?
+
+Traditionally to allow user space applications interact with hardware,
+there is a separate instance required per process. For e.g. consider
+doorbells as a mechanism of informing hardware about work to process. Each
+doorbell is r

Re: [PATCH] iommu/arm-smmu: Demote error messages to debug in shutdown callback

2020-03-30 Thread Doug Anderson
Hi,

On Sat, Mar 28, 2020 at 12:35 AM Sai Prakash Ranjan
 wrote:
>
> > Of course the fact that in practice we'll *always* see the warning
> > because there's no way to tear down the default DMA domains, and even
> > if all devices *have* been nicely quiesced there's no way to tell, is
> > certainly less than ideal. Like I say, it's not entirely clear-cut
> > either way...
> >
>
> Thanks for these examples, good to know these scenarios in case we come
> across these.
> However, if we see these error/warning messages appear everytime then
> what will be
> the credibility of these messages? We will just ignore these messages
> when
> these issues you mention actually appears because we see them everytime
> on
> reboot or shutdown.

I would agree that if these messages are expected to be seen every
time, there's no way to fix them, and they're not indicative of any
problem then something should be done.  Seeing something printed at
"dev_error" level with an exclamation point (!) at the end makes me
feel like this is something that needs immediate action on my part.

If we really can't do better but feel that the messages need to be
there, at least make them dev_info and less scary like:

  arm-smmu 1500.iommu: turning off; DMA should be quiesced before now

...that would still give you a hint in the logs that if you saw a DMA
transaction after the message that it was a bug but also wouldn't
sound scary to someone who wasn't seeing any other problems.

-Doug
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V10 05/11] iommu/vt-d: Add nested translation helper function

2020-03-30 Thread Jacob Pan
On Sun, 29 Mar 2020 16:03:36 +0800
Lu Baolu  wrote:

> On 2020/3/27 20:21, Tian, Kevin wrote:
> >> From: Jacob Pan 
> >> Sent: Saturday, March 21, 2020 7:28 AM
> >>
> >> Nested translation mode is supported in VT-d 3.0 Spec.CH 3.8.  
> > 
> > now the spec is already at rev3.1 😊  
> 
> Updated.
> 
> >   
> >> With PASID granular translation type set to 0x11b, translation
> >> result from the first level(FL) also subject to a second level(SL)
> >> page table translation. This mode is used for SVA virtualization,
> >> where FL performs guest virtual to guest physical translation and
> >> SL performs guest physical to host physical translation.
> >>
> >> This patch adds a helper function for setting up nested translation
> >> where second level comes from a domain and first level comes from
> >> a guest PGD.
> >>
> >> Signed-off-by: Jacob Pan 
> >> Signed-off-by: Liu, Yi L 
> >> ---
> >>   drivers/iommu/intel-pasid.c | 240
> >> +++-
> >>   drivers/iommu/intel-pasid.h |  12 +++
> >>   include/linux/intel-iommu.h |   3 +
> >>   3 files changed, 252 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/iommu/intel-pasid.c
> >> b/drivers/iommu/intel-pasid.c index 9bdb7ee228b6..10c7856afc6b
> >> 100644 --- a/drivers/iommu/intel-pasid.c
> >> +++ b/drivers/iommu/intel-pasid.c
> >> @@ -359,6 +359,76 @@ pasid_set_flpm(struct pasid_entry *pe, u64
> >> value) pasid_set_bits(&pe->val[2], GENMASK_ULL(3, 2), value << 2);
> >>   }
> >>
> >> +/*
> >> + * Setup the Extended Memory Type(EMT) field (Bits 91-93)
> >> + * of a scalable mode PASID entry.
> >> + */
> >> +static inline void
> >> +pasid_set_emt(struct pasid_entry *pe, u64 value)
> >> +{
> >> +  pasid_set_bits(&pe->val[1], GENMASK_ULL(29, 27), value <<
> >> 27); +}
> >> +
> >> +/*
> >> + * Setup the Page Attribute Table (PAT) field (Bits 96-127)
> >> + * of a scalable mode PASID entry.
> >> + */
> >> +static inline void
> >> +pasid_set_pat(struct pasid_entry *pe, u64 value)
> >> +{
> >> +  pasid_set_bits(&pe->val[1], GENMASK_ULL(63, 32), value <<
> >> 32); +}
> >> +
> >> +/*
> >> + * Setup the Cache Disable (CD) field (Bit 89)
> >> + * of a scalable mode PASID entry.
> >> + */
> >> +static inline void
> >> +pasid_set_cd(struct pasid_entry *pe)
> >> +{
> >> +  pasid_set_bits(&pe->val[1], 1 << 25, 1 << 25);
> >> +}
> >> +
> >> +/*
> >> + * Setup the Extended Memory Type Enable (EMTE) field (Bit 90)
> >> + * of a scalable mode PASID entry.
> >> + */
> >> +static inline void
> >> +pasid_set_emte(struct pasid_entry *pe)
> >> +{
> >> +  pasid_set_bits(&pe->val[1], 1 << 26, 1 << 26);
> >> +}
> >> +
> >> +/*
> >> + * Setup the Extended Access Flag Enable (EAFE) field (Bit 135)
> >> + * of a scalable mode PASID entry.
> >> + */
> >> +static inline void
> >> +pasid_set_eafe(struct pasid_entry *pe)
> >> +{
> >> +  pasid_set_bits(&pe->val[2], 1 << 7, 1 << 7);
> >> +}
> >> +
> >> +/*
> >> + * Setup the Page-level Cache Disable (PCD) field (Bit 95)
> >> + * of a scalable mode PASID entry.
> >> + */
> >> +static inline void
> >> +pasid_set_pcd(struct pasid_entry *pe)
> >> +{
> >> +  pasid_set_bits(&pe->val[1], 1 << 31, 1 << 31);
> >> +}
> >> +
> >> +/*
> >> + * Setup the Page-level Write-Through (PWT)) field (Bit 94)
> >> + * of a scalable mode PASID entry.
> >> + */
> >> +static inline void
> >> +pasid_set_pwt(struct pasid_entry *pe)
> >> +{
> >> +  pasid_set_bits(&pe->val[1], 1 << 30, 1 << 30);
> >> +}
> >> +
> >>   static void
> >>   pasid_cache_invalidation_with_pasid(struct intel_iommu *iommu,
> >>u16 did, int pasid)
> >> @@ -492,7 +562,7 @@ int intel_pasid_setup_first_level(struct
> >> intel_iommu *iommu,
> >>pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> >>
> >>/* Setup Present and PASID Granular Transfer Type: */
> >> -  pasid_set_translation_type(pte, 1);
> >> +  pasid_set_translation_type(pte, PASID_ENTRY_PGTT_FL_ONLY);
> >>pasid_set_present(pte);
> >>pasid_flush_caches(iommu, pte, pasid, did);
> >>
> >> @@ -564,7 +634,7 @@ int intel_pasid_setup_second_level(struct
> >> intel_iommu *iommu,
> >>pasid_set_domain_id(pte, did);
> >>pasid_set_slptr(pte, pgd_val);
> >>pasid_set_address_width(pte, agaw);
> >> -  pasid_set_translation_type(pte, 2);
> >> +  pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
> >>pasid_set_fault_enable(pte);
> >>pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> >>
> >> @@ -598,7 +668,7 @@ int intel_pasid_setup_pass_through(struct
> >> intel_iommu *iommu,
> >>pasid_clear_entry(pte);
> >>pasid_set_domain_id(pte, did);
> >>pasid_set_address_width(pte, iommu->agaw);
> >> -  pasid_set_translation_type(pte, 4);
> >> +  pasid_set_translation_type(pte, PASID_ENTRY_PGTT_PT);
> >>pasid_set_fault_enable(pte);
> >>pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
> >>
> >> @@ -612,3 +682,167 @@ int intel_pasid_setup_pass_through(struct
> >> intel_iommu *iommu,
> >>
> >>return 0;
> >>   }
> >> +
> >> 

Re: [PATCH V10 03/11] iommu/vt-d: Add a helper function to skip agaw

2020-03-30 Thread Jacob Pan
On Sun, 29 Mar 2020 15:20:55 +0800
Lu Baolu  wrote:

> On 2020/3/27 19:53, Tian, Kevin wrote:
> >> From: Jacob Pan 
> >> Sent: Saturday, March 21, 2020 7:28 AM
> >>
> >> Signed-off-by: Jacob Pan   
> > 
> > could you elaborate in which scenario this helper function is
> > required?  
> 
> I added below commit message:
> 
>  An Intel iommu domain uses 5-level page table by default. If the
>  iommu that the domain tries to attach supports less page levels,
>  the top level page tables should be skipped. Add a helper to do
>  this so that it could be used in other places.
> 
Thanks Baolu,
I will also add this to my v11, it might save you some time :)


> Best regards,
> baolu
> 
> > 
> >> ---
> >>   drivers/iommu/intel-pasid.c | 22 ++
> >>   1 file changed, 22 insertions(+)
> >>
> >> diff --git a/drivers/iommu/intel-pasid.c
> >> b/drivers/iommu/intel-pasid.c index 22b30f10b396..191508c7c03e
> >> 100644 --- a/drivers/iommu/intel-pasid.c
> >> +++ b/drivers/iommu/intel-pasid.c
> >> @@ -500,6 +500,28 @@ int intel_pasid_setup_first_level(struct
> >> intel_iommu *iommu,
> >>   }
> >>
> >>   /*
> >> + * Skip top levels of page tables for iommu which has less agaw
> >> + * than default. Unnecessary for PT mode.
> >> + */
> >> +static inline int iommu_skip_agaw(struct dmar_domain *domain,
> >> +struct intel_iommu *iommu,
> >> +struct dma_pte **pgd)
> >> +{
> >> +  int agaw;
> >> +
> >> +  for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) {
> >> +  *pgd = phys_to_virt(dma_pte_addr(*pgd));
> >> +  if (!dma_pte_present(*pgd)) {
> >> +  return -EINVAL;
> >> +  }
> >> +  }
> >> +  pr_debug_ratelimited("%s: pgd: %llx, agaw %d d_agaw %d\n",
> >> __func__, (u64)*pgd,
> >> +  iommu->agaw, domain->agaw);
> >> +
> >> +  return agaw;
> >> +}
> >> +
> >> +/*
> >>* Set up the scalable mode pasid entry for second only
> >> translation type. */
> >>   int intel_pasid_setup_second_level(struct intel_iommu *iommu,
> >> --
> >> 2.7.4  
> >   

[Jacob Pan]
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 1/3] iommu/uapi: Define uapi version and capabilities

2020-03-30 Thread Jacob Pan
On Mon, 30 Mar 2020 05:40:40 +
"Tian, Kevin"  wrote:

> > From: Jacob Pan 
> > Sent: Saturday, March 28, 2020 7:54 AM
> > 
> > On Fri, 27 Mar 2020 00:47:02 -0700
> > Christoph Hellwig  wrote:
> >   
> > > On Fri, Mar 27, 2020 at 02:49:55AM +, Tian, Kevin wrote:  
> > > > If those API calls are inter-dependent for composing a feature
> > > > (e.g. SVA), shouldn't we need a way to check them together
> > > > before exposing the feature to the guest, e.g. through a
> > > > iommu_get_uapi_capabilities interface?  
> > >
> > > Yes, that makes sense.  The important bit is to have a capability
> > > flags and not version numbers.  
> > 
> > The challenge is that there are two consumers in the kernel for
> > this. 1. VFIO only look for compatibility, and size of each data
> > struct such that it can copy_from_user.
> > 
> > 2. IOMMU driver, the "real consumer" of the content.
> > 
> > For 2, I agree and we do plan to use the capability flags to check
> > content and maintain backward compatibility etc.
> > 
> > For VFIO, it is difficult to do size look up based on capability
> > flags. 
> 
> Can you elaborate the difficulty in VFIO? if, as Christoph Hellwig
> pointed out, version number is already avoided everywhere, it is 
> interesting to know whether this work becomes a real exception
> or just requires a different mindset.
> 
>From VFIO p.o.v. the IOMMU UAPI data is opaque, it only needs to do two
things:
1. is the UAPI compatible?
2. what is the size to copy?

If you look at the version number, this is really a "version as size"
lookup, as provided by the helper function in this patch. An example
can be the newly introduced clone3 syscall.
https://lwn.net/Articles/792628/
In clone3, new version must have new size. The slight difference here
is that, unlike clone3, we have multiple data structures instead of a
single struct clone_args {}. And each struct has flags to enumerate its
contents besides size.

Besides breaching data abstraction, if VFIO has to check IOMMU flags to
determine the sizes, it has many combinations.

We also separate the responsibilities into two parts
1. compatibility - version, size by VFIO
2. sanity check - capability flags - by IOMMU

I think the latter matches what Christoph's comments. So we are in
agreement at the IOMMU level :)

For example:
During guest PASID bind, IOMMU driver operates on the data passed from
VFIO and check format & flags to take different code path.

#define IOMMU_PASID_FORMAT_INTEL_VTD1
__u32 format;
#define IOMMU_SVA_GPASID_VAL(1 << 0) /* guest PASID valid */
__u64 flags;

Jacob

> btw the most relevant discussion which I can find out now is here:
>   https://lkml.org/lkml/2020/2/3/1126
> 
> It mentioned 3 options for handling extension:
> --
> 1. Disallow adding new members to each structure other than reuse
> padding bits or adding union members at the end.
> 2. Allow extension of the structures beyond union, but union size has
> to be fixed with reserved spaces
> 3. Adopt VFIO argsz scheme, I don't think we need version for each
> struct anymore. argsz implies the version that user is using assuming
> UAPI data is extension only.
> --
> 
> the first two are both version-based. Looks most guys agreed with 
> option-1 (in this v2), but Alex didn't give his opinion at the
> moment. The last response from him was the raise of option-3 using
> argsz to avoid version. So, we also need hear from him. Alex?
> 
> Thanks
> Kevin

[Jacob Pan]
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

2020-03-30 Thread Liu, Yi L
> From: Tian, Kevin 
> Sent: Monday, March 30, 2020 4:32 PM
> To: Liu, Yi L ; alex.william...@redhat.com;
> Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> > From: Liu, Yi L 
> > Sent: Sunday, March 22, 2020 8:32 PM
> >
> > From: Liu Yi L 
> >
> > For a long time, devices have only one DMA address space from platform
> > IOMMU's point of view. This is true for both bare metal and directed-
> > access in virtualization environment. Reason is the source ID of DMA in
> > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> 
> are->is

thanks.

> > DMA isolation. However, this is changing with the latest advancement in
> > I/O technology area. More and more platform vendors are utilizing the PCIe
> > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > address spaces as identified by their individual PASIDs. For example,
> > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > let device access multiple process virtual address space by binding the
> 
> "address space" -> "address spaces"
> 
> "binding the" -> "binding each"

will correct both.

> > virtual address space with a PASID. Wherein the PASID is allocated in
> > software and programmed to device per device specific manner. Devices
> > which support PASID capability are called PASID-capable devices. If such
> > devices are passed through to VMs, guest software are also able to bind
> > guest process virtual address space on such devices. Therefore, the guest
> > software could reuse the bare metal software programming model, which
> > means guest software will also allocate PASID and program it to device
> > directly. This is a dangerous situation since it has potential PASID
> > conflicts and unauthorized address space access. It would be safer to
> > let host intercept in the guest software's PASID allocation. Thus PASID
> > are managed system-wide.
> >
> > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> > passdown
> > PASID allocation/free request from the virtual IOMMU. Additionally, such
> 
> "Additionally, because such"
> 
> > requests are intended to be invoked by QEMU or other applications which
> 
> simplify to "intended to be invoked from userspace"

got it.

> > are running in userspace, it is necessary to have a mechanism to prevent
> > single application from abusing available PASIDs in system. With such
> > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > has many assigned devices, then it should have more quota. However, it
> > is not sure how many PASIDs an assigned devices will use. e.g. it is
> 
> devices -> device

got it.

> > possible that a VM with multiples assigned devices but requests less
> > PASIDs. Therefore per-VM quota would be better.
> >
> > This patch uses struct mm pointer as a per-VM token. We also considered
> > using task structure pointer and vfio_iommu structure pointer. However,
> > task structure is per-thread, which means it cannot achieve per-VM PASID
> > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > only within vfio. Therefore, structure mm pointer is selected. This patch
> > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > container is opened by a VM. On the reverse order, vfio_mm is free when
> > the last vfio container is released. Each VM is assigned with a PASID
> > quota, so that it is not able to request PASID beyond its quota. This
> > patch adds a default quota of 1000. This quota could be tuned by
> > administrator. Making PASID quota tunable will be added in another patch
> > in this series.
> >
> > Previous discussions:
> > https://patchwork.kernel.org/patch/11209429/
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Yi Sun 
> > Signed-off-by: Jacob Pan 
> > ---
> >  drivers/vfio/vfio.c | 130
> > 
> >  drivers/vfio/vfio_iommu_type1.c | 104
> > 
> >  include/linux/vfio.h|  20 +++
> >  include/uapi/linux/vfio.h   |  41 +
> >  4 files changed, 295 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index c848262..d13b483 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -32,6 +32,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #define DRIVER_VERSION "0.3"
> >  #define DRIVER_AUTHOR  "Alex Williamson
> > "
> > @@ -46,6 +47,8 @@ static struct vfio {
> > struct mutexgroup_lock;
> > struct cdev group_cdev;
> > dev_t   group_devt;
> > +   struct list_headvfio_mm_list;
> > +   struct mutexvfio_mm_lock;
> > wait_que

Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Konrad Rzeszutek Wilk
On Mon, Mar 30, 2020 at 02:06:01PM +0800, Kairui Song wrote:
> On Sat, Mar 28, 2020 at 7:57 PM Dave Young  wrote:
> >
> > On 03/26/20 at 05:29pm, Alexander Graf wrote:
> > > The swiotlb is a very convenient fallback mechanism for bounce buffering 
> > > of
> > > DMAable data. It is usually used for the compatibility case where devices
> > > can only DMA to a "low region".
> > >
> > > However, in some scenarios this "low region" may be bound even more
> > > heavily. For example, there are embedded system where only an SRAM region
> > > is shared between device and CPU. There are also heterogeneous computing
> > > scenarios where only a subset of RAM is cache coherent between the
> > > components of the system. There are partitioning hypervisors, where
> > > a "control VM" that implements device emulation has limited view into a
> > > partition's memory for DMA capabilities due to safety concerns.
> > >
> > > This patch adds a command line driven mechanism to move all DMA memory 
> > > into
> > > a predefined shared memory region which may or may not be part of the
> > > physical address layout of the Operating System.
> > >
> > > Ideally, the typical path to set this configuration would be through 
> > > Device
> > > Tree or ACPI, but neither of the two mechanisms is standardized yet. Also,
> > > in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
> > > instead configure the system purely through kernel command line options.
> > >
> > > I'm sure other people will find the functionality useful going forward
> > > though and extend it to be triggered by DT/ACPI in the future.
> >
> > Hmm, we have a use case for kdump, this maybe useful.  For example
> > swiotlb is enabled by default if AMD SME/SEV is active, and in kdump
> > kernel we have to increase the crashkernel reserved size for the extra
> > swiotlb requirement.  I wonder if we can just reuse the old kernel's
> > swiotlb region and pass the addr to kdump kernel.
> >
> 
> Yes, definitely helpful for kdump kernel. This can help reduce the
> crashkernel value.
> 
> Previously I was thinking about something similar, play around the
> e820 entry passed to kdump and let it place swiotlb in wanted region.
> Simply remap it like in this patch looks much cleaner.
> 
> If this patch is acceptable, one more patch is needed to expose the
> swiotlb in iomem, so kexec-tools can pass the right kernel cmdline to
> second kernel.

We seem to be passsing a lot of data to kexec.. Perhaps something
of a unified way since we seem to have a lot of things to pass - disabling
IOMMU, ACPI RSDT address, and then this.

CC-ing Anthony who is working on something - would you by any chance
have a doc on this?

Thanks!
> 
> > >
> > > Signed-off-by: Alexander Graf 
> > > ---
> > >  Documentation/admin-guide/kernel-parameters.txt |  3 +-
> > >  Documentation/x86/x86_64/boot-options.rst   |  4 ++-
> > >  kernel/dma/swiotlb.c| 46 
> > > +++--
> > >  3 files changed, 49 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > > b/Documentation/admin-guide/kernel-parameters.txt
> > > index c07815d230bc..d085d55c3cbe 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -4785,11 +4785,12 @@
> > >   it if 0 is given (See 
> > > Documentation/admin-guide/cgroup-v1/memory.rst)
> > >
> > >   swiotlb=[ARM,IA-64,PPC,MIPS,X86]
> > > - Format: {  | force | noforce }
> > > + Format: {  | force | noforce | addr= > > addr> }
> > >-- Number of I/O TLB slabs
> > >   force -- force using of bounce buffers even if they
> > >wouldn't be automatically used by the 
> > > kernel
> > >   noforce -- Never use bounce buffers (for debugging)
> > > + addr= -- Try to allocate SWIOTLB at 
> > > defined address
> > >
> > >   switches=   [HW,M68k]
> > >
> > > diff --git a/Documentation/x86/x86_64/boot-options.rst 
> > > b/Documentation/x86/x86_64/boot-options.rst
> > > index 2b98efb5ba7f..ca46c57b68c9 100644
> > > --- a/Documentation/x86/x86_64/boot-options.rst
> > > +++ b/Documentation/x86/x86_64/boot-options.rst
> > > @@ -297,11 +297,13 @@ iommu options only relevant to the AMD GART 
> > > hardware IOMMU:
> > >  iommu options only relevant to the software bounce buffering (SWIOTLB) 
> > > IOMMU
> > >  implementation:
> > >
> > > -swiotlb=[,force]
> > > +swiotlb=[,force][,addr=]
> > >
> > >  Prereserve that many 128K pages for the software IO bounce 
> > > buffering.
> > >force
> > >  Force all IO through the software TLB.
> > > +  addr=
> > > +Try to allocate SWIOTLB at defined address
> > >
> > >  Settings for the IBM Calgary hardware IOMMU currently found in IBM
> > > 

Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Mark Rutland
On Thu, Mar 26, 2020 at 06:11:31PM +0100, Alexander Graf wrote:
> On 26.03.20 18:05, Christoph Hellwig wrote:
> > 
> > On Thu, Mar 26, 2020 at 05:29:22PM +0100, Alexander Graf wrote:
> > > The swiotlb is a very convenient fallback mechanism for bounce buffering 
> > > of
> > > DMAable data. It is usually used for the compatibility case where devices
> > > can only DMA to a "low region".
> > > 
> > > However, in some scenarios this "low region" may be bound even more
> > > heavily. For example, there are embedded system where only an SRAM region
> > > is shared between device and CPU. There are also heterogeneous computing
> > > scenarios where only a subset of RAM is cache coherent between the
> > > components of the system. There are partitioning hypervisors, where
> > > a "control VM" that implements device emulation has limited view into a
> > > partition's memory for DMA capabilities due to safety concerns.
> > > 
> > > This patch adds a command line driven mechanism to move all DMA memory 
> > > into
> > > a predefined shared memory region which may or may not be part of the
> > > physical address layout of the Operating System.
> > > 
> > > Ideally, the typical path to set this configuration would be through 
> > > Device
> > > Tree or ACPI, but neither of the two mechanisms is standardized yet. Also,
> > > in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
> > > instead configure the system purely through kernel command line options.
> > > 
> > > I'm sure other people will find the functionality useful going forward
> > > though and extend it to be triggered by DT/ACPI in the future.
> > 
> > I'm totally against hacking in a kernel parameter for this.  We'll need
> > a proper documented DT or ACPI way.
> 
> I'm with you on that sentiment, but in the environment I'm currently looking
> at, we have neither DT nor ACPI: The kernel gets purely configured via
> kernel command line. For other unenumerable artifacts on the system, such as
> virtio-mmio platform devices, that works well enough and also basically
> "hacks a kernel parameter" to specify the system layout.

On the arm64 front, you'd *have* to pass a DT to the kernel (as that's
where we get the command line from), and we *only* discover memory
from the DT or EFI memory map, so the arguments above aren't generally
applicable. You can enumerate virtio-mmio devices from DT, also.

Device-specific constraints on memory should really be described in a
per-device fashion in the FW tables so that the OS can decide how to
handle them. Just becuase one device can only access memory in a
specific 1MiB window doesn't mean all other should be forced to share
the same constraint. I think that's what Christoph was alluding to.

Thanks,
Mark.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed mdevs

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L 
> 
> Recent years, mediated device pass-through framework (e.g. vfio-mdev)
> are used to achieve flexible device sharing across domains (e.g. VMs).

are->is

> Also there are hardware assisted mediated pass-through solutions from
> platform vendors. e.g. Intel VT-d scalable mode which supports Intel
> Scalable I/O Virtualization technology. Such mdevs are called IOMMU-
> backed mdevs as there are IOMMU enforced DMA isolation for such mdevs.
> In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-
> domain
> concept, which means mdevs are protected by an iommu domain which is
> aux-domain of its physical device. Details can be found in the KVM

"by an iommu domain which is auxiliary to the domain that the kernel
driver primarily uses for DMA API"

> presentation from Kevin Tian. IOMMU-backed equals to IOMMU-capable.
> 
> https://events19.linuxfoundation.org/wp-content/uploads/2017/12/\
> Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf
> 
> This patch supports NESTING IOMMU for IOMMU-backed mdevs by figuring
> out the physical device of an IOMMU-backed mdev and then invoking
> IOMMU
> requests to IOMMU layer with the physical device and the mdev's aux
> domain info.

"and then calling into the IOMMU layer to complete the vSVA operations
on the aux domain associated with that mdev"

> 
> With this patch, vSVA (Virtual Shared Virtual Addressing) can be used
> on IOMMU-backed mdevs.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> CC: Jun Tian 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 23 ---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 937ec3f..d473665 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -132,6 +132,7 @@ struct vfio_regions {
> 
>  struct domain_capsule {
>   struct iommu_domain *domain;
> + struct vfio_group *group;
>   void *data;
>  };
> 
> @@ -148,6 +149,7 @@ static int vfio_iommu_for_each_dev(struct
> vfio_iommu *iommu,
>   list_for_each_entry(d, &iommu->domain_list, next) {
>   dc.domain = d->domain;
>   list_for_each_entry(g, &d->group_list, next) {
> + dc.group = g;
>   ret = iommu_group_for_each_dev(g->iommu_group,
>  &dc, fn);
>   if (ret)
> @@ -2347,7 +2349,12 @@ static int vfio_bind_gpasid_fn(struct device *dev,
> void *data)
>   struct iommu_gpasid_bind_data *gbind_data =
>   (struct iommu_gpasid_bind_data *) dc->data;
> 
> - return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> + if (dc->group->mdev_group)
> + return iommu_sva_bind_gpasid(dc->domain,
> + vfio_mdev_get_iommu_device(dev), gbind_data);
> + else
> + return iommu_sva_bind_gpasid(dc->domain,
> + dev, gbind_data);
>  }
> 
>  static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> @@ -2356,8 +2363,13 @@ static int vfio_unbind_gpasid_fn(struct device
> *dev, void *data)
>   struct iommu_gpasid_bind_data *gbind_data =
>   (struct iommu_gpasid_bind_data *) dc->data;
> 
> - return iommu_sva_unbind_gpasid(dc->domain, dev,
> + if (dc->group->mdev_group)
> + return iommu_sva_unbind_gpasid(dc->domain,
> + vfio_mdev_get_iommu_device(dev),
>   gbind_data->hpasid);
> + else
> + return iommu_sva_unbind_gpasid(dc->domain, dev,
> + gbind_data->hpasid);
>  }
> 
>  /**
> @@ -2429,7 +2441,12 @@ static int vfio_cache_inv_fn(struct device *dev,
> void *data)
>   struct iommu_cache_invalidate_info *cache_inv_info =
>   (struct iommu_cache_invalidate_info *) dc->data;
> 
> - return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> + if (dc->group->mdev_group)
> + return iommu_cache_invalidate(dc->domain,
> + vfio_mdev_get_iommu_device(dev), cache_inv_info);
> + else
> + return iommu_cache_invalidate(dc->domain,
> + dev, cache_inv_info);
>  }

possibly above could be simplified, e.g. 

static struct device *vfio_get_iommu_device(struct vfio_group *group, 
struct device *dev)
{
if  (group->mdev_group)
return vfio_mdev_get_iommu_device(dev);
else
return dev;
}

Then use it to replace plain 'dev' in all three places.

> 
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
> --
> 2.7.4

___
iommu mailing list
iommu@lists.l

RE: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L 
> 
> For VFIO IOMMUs with the type VFIO_TYPE1_NESTING_IOMMU, guest
> "owns" the
> first-level/stage-1 translation structures, the host IOMMU driver has no
> knowledge of first-level/stage-1 structure cache updates unless the guest
> invalidation requests are trapped and propagated to the host.
> 
> This patch adds a new IOCTL VFIO_IOMMU_CACHE_INVALIDATE to
> propagate guest
> first-level/stage-1 IOMMU cache invalidations to host to ensure IOMMU
> cache
> correctness.
> 
> With this patch, vSVA (Virtual Shared Virtual Addressing) can be used safely
> as the host IOMMU iotlb correctness are ensured.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Eric Auger 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 49
> +
>  include/uapi/linux/vfio.h   | 22 ++
>  2 files changed, 71 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index a877747..937ec3f 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2423,6 +2423,15 @@ static long
> vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
>   return ret;
>  }
> 
> +static int vfio_cache_inv_fn(struct device *dev, void *data)

vfio_iommu_cache_inv_fn

> +{
> + struct domain_capsule *dc = (struct domain_capsule *)data;
> + struct iommu_cache_invalidate_info *cache_inv_info =
> + (struct iommu_cache_invalidate_info *) dc->data;
> +
> + return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  unsigned int cmd, unsigned long arg)
>  {
> @@ -2629,6 +2638,46 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
>   }
>   kfree(gbind_data);
>   return ret;
> + } else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> + struct vfio_iommu_type1_cache_invalidate cache_inv;
> + u32 version;
> + int info_size;
> + void *cache_info;
> + int ret;
> +
> + minsz = offsetofend(struct
> vfio_iommu_type1_cache_invalidate,
> + flags);
> +
> + if (copy_from_user(&cache_inv, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (cache_inv.argsz < minsz || cache_inv.flags)
> + return -EINVAL;
> +
> + /* Get the version of struct iommu_cache_invalidate_info */
> + if (copy_from_user(&version,
> + (void __user *) (arg + minsz), sizeof(version)))
> + return -EFAULT;
> +
> + info_size = iommu_uapi_get_data_size(
> + IOMMU_UAPI_CACHE_INVAL,
> version);
> +
> + cache_info = kzalloc(info_size, GFP_KERNEL);
> + if (!cache_info)
> + return -ENOMEM;
> +
> + if (copy_from_user(cache_info,
> + (void __user *) (arg + minsz), info_size)) {
> + kfree(cache_info);
> + return -EFAULT;
> + }
> +
> + mutex_lock(&iommu->lock);
> + ret = vfio_iommu_for_each_dev(iommu, vfio_cache_inv_fn,
> + cache_info);
> + mutex_unlock(&iommu->lock);
> + kfree(cache_info);
> + return ret;
>   }
> 
>   return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2235bc6..62ca791 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -899,6 +899,28 @@ struct vfio_iommu_type1_bind {
>   */
>  #define VFIO_IOMMU_BIND  _IO(VFIO_TYPE, VFIO_BASE + 23)
> 
> +/**
> + * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 24,
> + *   struct vfio_iommu_type1_cache_invalidate)
> + *
> + * Propagate guest IOMMU cache invalidation to the host. The cache
> + * invalidation information is conveyed by @cache_info, the content
> + * format would be structures defined in uapi/linux/iommu.h. User
> + * should be aware of that the struct  iommu_cache_invalidate_info
> + * has a @version field, vfio needs to parse this field before getting
> + * data from userspace.
> + *
> + * Availability of this IOCTL is after VFIO_SET_IOMMU.
> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +struct vfio_iommu_type1_cache_invalidate {
> + __u32   argsz;
> + __u32   flags;
> + struct  iommu_cache_invalidate_info cache_info;
> +};
> +#define VFIO_IOMMU_CACHE_INVALIDATE  _IO(VFIO_TYPE, VFIO_BASE +
> 24)
> +
>  /*  Additional API for SPAPR TCE (Server POWERPC) IOMMU  */
> 
>  /*
> --
> 2.7.4

This 

RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L 
> 
> VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> hardware
> IOMMUs that have nesting DMA translation (a.k.a dual stage address
> translation). For such hardware IOMMUs, there are two stages/levels of
> address translation, and software may let userspace/VM to own the first-
> level/stage-1 translation structures. Example of such usage is vSVA (
> virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> translation structures and bind the structures to host, then hardware
> IOMMU would utilize nesting translation when doing DMA translation fo
> the devices behind such hardware IOMMU.
> 
> This patch adds vfio support for binding guest translation (a.k.a stage 1)
> structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only
> bind
> guest page table is needed, it also requires to expose interface to guest
> for iommu cache invalidation when guest modified the first-level/stage-1
> translation structures since hardware needs to be notified to flush stale
> iotlbs. This would be introduced in next patch.
> 
> In this patch, guest page table bind and unbind are done by using flags
> VFIO_IOMMU_BIND_GUEST_PGTBL and
> VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
> VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> struct iommu_gpasid_bind_data. Before binding guest page table to host,
> VM should have got a PASID allocated by host via
> VFIO_IOMMU_PASID_REQUEST.
> 
> Bind guest translation structures (here is guest page table) to host

Bind -> Binding

> are the first step to setup vSVA (Virtual Shared Virtual Addressing).

are -> is. and you already explained vSVA earlier.

> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 158
> 
>  include/uapi/linux/vfio.h   |  46 
>  2 files changed, 204 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 82a9e0b..a877747 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -130,6 +130,33 @@ struct vfio_regions {
>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)  \
>   (!list_empty(&iommu->domain_list))
> 
> +struct domain_capsule {
> + struct iommu_domain *domain;
> + void *data;
> +};
> +
> +/* iommu->lock must be held */
> +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> +   int (*fn)(struct device *dev, void *data),
> +   void *data)
> +{
> + struct domain_capsule dc = {.data = data};
> + struct vfio_domain *d;
> + struct vfio_group *g;
> + int ret = 0;
> +
> + list_for_each_entry(d, &iommu->domain_list, next) {
> + dc.domain = d->domain;
> + list_for_each_entry(g, &d->group_list, next) {
> + ret = iommu_group_for_each_dev(g->iommu_group,
> +&dc, fn);
> + if (ret)
> + break;
> + }
> + }
> + return ret;
> +}
> +
>  static int put_pfn(unsigned long pfn, int prot);
> 
>  /*
> @@ -2314,6 +2341,88 @@ static int
> vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>   return 0;
>  }
> 
> +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> +{
> + struct domain_capsule *dc = (struct domain_capsule *)data;
> + struct iommu_gpasid_bind_data *gbind_data =
> + (struct iommu_gpasid_bind_data *) dc->data;
> +

In Jacob's vSVA iommu series, [PATCH 06/11]:

+   /* REVISIT: upper layer/VFIO can track host process that bind 
the PASID.
+* ioasid_set = mm might be sufficient for vfio to check pasid 
VMM
+* ownership.
+*/

I asked him who exactly should be responsible for tracking the pasid
ownership. Although no response yet, I expect vfio/iommu can have
a clear policy and also documented here to provide consistent 
message.

> + return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> +}
> +
> +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> +{
> + struct domain_capsule *dc = (struct domain_capsule *)data;
> + struct iommu_gpasid_bind_data *gbind_data =
> + (struct iommu_gpasid_bind_data *) dc->data;
> +
> + return iommu_sva_unbind_gpasid(dc->domain, dev,
> + gbind_data->hpasid);

curious why we have to share the same bind_data structure
between bind and unbind, especially when unbind requires
only one field? I didn't see a clear reason, and just similar
to earlier ALLOC/FREE which don't share structure either.
Current way simply wastes space for unbind operation...


RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L 
> 
> VFIO exposes IOMMU nesting translation (a.k.a dual stage translation)
> capability to userspace. Thus applications like QEMU could support
> vIOMMU with hardware's nesting translation capability for pass-through
> devices. Before setting up nesting translation for pass-through devices,
> QEMU and other applications need to learn the supported 1st-lvl/stage-1
> translation structure format like page table format.
> 
> Take vSVA (virtual Shared Virtual Addressing) as an example, to support
> vSVA for pass-through devices, QEMU setup nesting translation for pass-
> through devices. The guest page table are configured to host as 1st-lvl/
> stage-1 page table. Therefore, guest format should be compatible with
> host side.
> 
> This patch reports the supported 1st-lvl/stage-1 page table format on the
> current platform to userspace. QEMU and other alike applications should
> use this format info when trying to setup IOMMU nesting translation on
> host IOMMU.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 56
> +
>  include/uapi/linux/vfio.h   |  1 +
>  2 files changed, 57 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 9aa2a67..82a9e0b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2234,11 +2234,66 @@ static int vfio_iommu_type1_pasid_free(struct
> vfio_iommu *iommu,
>   return ret;
>  }
> 
> +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> +  u32 *stage1_format)
> +{
> + struct vfio_domain *domain;
> + u32 format = 0, tmp_format = 0;
> + int ret;
> +
> + mutex_lock(&iommu->lock);
> + if (list_empty(&iommu->domain_list)) {
> + mutex_unlock(&iommu->lock);
> + return -EINVAL;
> + }
> +
> + list_for_each_entry(domain, &iommu->domain_list, next) {
> + if (iommu_domain_get_attr(domain->domain,
> + DOMAIN_ATTR_PASID_FORMAT, &format)) {
> + ret = -EINVAL;
> + format = 0;
> + goto out_unlock;
> + }
> + /*
> +  * format is always non-zero (the first format is
> +  * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
> +  * the reason of potential different backed IOMMU
> +  * formats, here we expect to have identical formats
> +  * in the domain list, no mixed formats support.
> +  * return -EINVAL to fail the attempt of setup
> +  * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
> +  * are detected.
> +  */
> + if (tmp_format && tmp_format != format) {
> + ret = -EINVAL;
> + format = 0;
> + goto out_unlock;
> + }
> +
> + tmp_format = format;
> + }

this path is invoked only in VFIO_IOMMU_GET_INFO path. If we don't
want to assume the status quo that one container holds only one
device w/ vIOMMU (the prerequisite for vSVA), looks we also need
check the format compatibility when attaching a new group to this
container?

> + ret = 0;
> +
> +out_unlock:
> + if (format)
> + *stage1_format = format;
> + mutex_unlock(&iommu->lock);
> + return ret;
> +}
> +
>  static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>struct vfio_info_cap *caps)
>  {
>   struct vfio_info_cap_header *header;
>   struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> + u32 formats = 0;
> + int ret;
> +
> + ret = vfio_iommu_get_stage1_format(iommu, &formats);
> + if (ret) {
> + pr_warn("Failed to get stage-1 format\n");
> + return ret;
> + }
> 
>   header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
>  VFIO_IOMMU_TYPE1_INFO_CAP_NESTING,
> 1);
> @@ -2254,6 +2309,7 @@ static int vfio_iommu_info_add_nesting_cap(struct
> vfio_iommu *iommu,
>   /* nesting iommu type supports PASID requests (alloc/free)
> */
>   nesting_cap->nesting_capabilities |=
> VFIO_IOMMU_PASID_REQS;
>   }
> + nesting_cap->stage1_formats = formats;
> 
>   return 0;
>  }
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ed9881d..ebeaf3e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -763,6 +763,7 @@ struct vfio_iommu_type1_info_cap_nesting {
>   struct  vfio_info_cap_header header;
>  #define VFIO_IOMMU_PASID_REQS(1 << 0)
>   __u32   nesting_capabilities;
> + __u32   stage1_formats;

do you plan to support multiple fo

RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Monday, March 30, 2020 5:27 PM
> 
> > From: Tian, Kevin 
> > Sent: Monday, March 30, 2020 5:20 PM
> > To: Liu, Yi L ; alex.william...@redhat.com;
> > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter
> for quota
> > tuning
> >
> > > From: Liu, Yi L 
> > > Sent: Monday, March 30, 2020 4:53 PM
> > >
> > > > From: Tian, Kevin 
> > > > Sent: Monday, March 30, 2020 4:41 PM
> > > > To: Liu, Yi L ; alex.william...@redhat.com;
> > > > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1
> > > > parameter
> > > for quota
> > > > tuning
> > > >
> > > > > From: Liu, Yi L 
> > > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > > >
> > > > > From: Liu Yi L 
> > > > >
> > > > > This patch adds a module option to make the PASID quota tunable by
> > > > > administrator.
> > > > >
> > > > > TODO: needs to think more on how to  make the tuning to be per-
> process.
> > > > >
> > > > > Previous discussions:
> > > > > https://patchwork.kernel.org/patch/11209429/
> > > > >
> > > > > Cc: Kevin Tian 
> > > > > CC: Jacob Pan 
> > > > > Cc: Alex Williamson 
> > > > > Cc: Eric Auger 
> > > > > Cc: Jean-Philippe Brucker 
> > > > > Signed-off-by: Liu Yi L 
> > > > > ---
> > > > >  drivers/vfio/vfio.c | 8 +++-
> > > > >  drivers/vfio/vfio_iommu_type1.c | 7 ++-
> > > > >  include/linux/vfio.h| 3 ++-
> > > > >  3 files changed, 15 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > > > d13b483..020a792 100644
> > > > > --- a/drivers/vfio/vfio.c
> > > > > +++ b/drivers/vfio/vfio.c
> > > > > @@ -2217,13 +2217,19 @@ struct vfio_mm
> > > *vfio_mm_get_from_task(struct
> > > > > task_struct *task)
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > > >
> > > > > -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min,
> > > > > +int
> > > max)
> > > > >  {
> > > > >   ioasid_t pasid;
> > > > >   int ret = -ENOSPC;
> > > > >
> > > > >   mutex_lock(&vmm->pasid_lock);
> > > > >
> > > > > + /* update quota as it is tunable by admin */
> > > > > + if (vmm->pasid_quota != quota) {
> > > > > + vmm->pasid_quota = quota;
> > > > > + ioasid_adjust_set(vmm->ioasid_sid, quota);
> > > > > + }
> > > > > +
> > > >
> > > > It's a bit weird to have quota adjusted in the alloc path, since the
> > > > latter
> > > might
> > > > be initiated by non-privileged users. Why not doing the simple math
> > > > in
> > > vfio_
> > > > create_mm to set the quota when the ioasid set is created? even in
> > > > the
> > > future
> > > > you may allow per-process quota setting, that should come from
> > > > separate privileged path instead of thru alloc..
> > >
> > > The reason is the kernel parameter modification has no event which can
> > > be used to adjust the quota. So I chose to adjust it in pasid_alloc
> > > path. If it's not good, how about adding one more IOCTL to let user-
> > > space trigger a quota adjustment event? Then even non-privileged user
> > > could trigger quota adjustment, the quota is actually controlled by
> > > privileged user. How about your opinion?
> > >
> >
> > why do you need an event to adjust? As I said, you can set the quota when
> the set is
> > created in vfio_create_mm...
> 
> oh, it's to support runtime adjustments. I guess it may be helpful to let
> per-VM quota tunable even the VM is running. If just set the quota in
> vfio_create_mm(), it is not able to adjust at runtime.
> 

ok, I didn't note the module parameter was granted with a write permission.
However there is a further problem. We cannot support PASID reclaim now.
What about the admin sets a quota smaller than previous value while some
IOASID sets already exceed the new quota? I'm not sure how to fail a runtime
module parameter change due to that situation. possibly a normal sysfs 
node better suites the runtime change requirement...

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb: Allow swiotlb to live at pre-defined address

2020-03-30 Thread Kairui Song
On Sat, Mar 28, 2020 at 7:57 PM Dave Young  wrote:
>
> On 03/26/20 at 05:29pm, Alexander Graf wrote:
> > The swiotlb is a very convenient fallback mechanism for bounce buffering of
> > DMAable data. It is usually used for the compatibility case where devices
> > can only DMA to a "low region".
> >
> > However, in some scenarios this "low region" may be bound even more
> > heavily. For example, there are embedded system where only an SRAM region
> > is shared between device and CPU. There are also heterogeneous computing
> > scenarios where only a subset of RAM is cache coherent between the
> > components of the system. There are partitioning hypervisors, where
> > a "control VM" that implements device emulation has limited view into a
> > partition's memory for DMA capabilities due to safety concerns.
> >
> > This patch adds a command line driven mechanism to move all DMA memory into
> > a predefined shared memory region which may or may not be part of the
> > physical address layout of the Operating System.
> >
> > Ideally, the typical path to set this configuration would be through Device
> > Tree or ACPI, but neither of the two mechanisms is standardized yet. Also,
> > in the x86 MicroVM use case, we have neither ACPI nor Device Tree, but
> > instead configure the system purely through kernel command line options.
> >
> > I'm sure other people will find the functionality useful going forward
> > though and extend it to be triggered by DT/ACPI in the future.
>
> Hmm, we have a use case for kdump, this maybe useful.  For example
> swiotlb is enabled by default if AMD SME/SEV is active, and in kdump
> kernel we have to increase the crashkernel reserved size for the extra
> swiotlb requirement.  I wonder if we can just reuse the old kernel's
> swiotlb region and pass the addr to kdump kernel.
>

Yes, definitely helpful for kdump kernel. This can help reduce the
crashkernel value.

Previously I was thinking about something similar, play around the
e820 entry passed to kdump and let it place swiotlb in wanted region.
Simply remap it like in this patch looks much cleaner.

If this patch is acceptable, one more patch is needed to expose the
swiotlb in iomem, so kexec-tools can pass the right kernel cmdline to
second kernel.

> >
> > Signed-off-by: Alexander Graf 
> > ---
> >  Documentation/admin-guide/kernel-parameters.txt |  3 +-
> >  Documentation/x86/x86_64/boot-options.rst   |  4 ++-
> >  kernel/dma/swiotlb.c| 46 
> > +++--
> >  3 files changed, 49 insertions(+), 4 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > b/Documentation/admin-guide/kernel-parameters.txt
> > index c07815d230bc..d085d55c3cbe 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4785,11 +4785,12 @@
> >   it if 0 is given (See 
> > Documentation/admin-guide/cgroup-v1/memory.rst)
> >
> >   swiotlb=[ARM,IA-64,PPC,MIPS,X86]
> > - Format: {  | force | noforce }
> > + Format: {  | force | noforce | addr= }
> >-- Number of I/O TLB slabs
> >   force -- force using of bounce buffers even if they
> >wouldn't be automatically used by the kernel
> >   noforce -- Never use bounce buffers (for debugging)
> > + addr= -- Try to allocate SWIOTLB at 
> > defined address
> >
> >   switches=   [HW,M68k]
> >
> > diff --git a/Documentation/x86/x86_64/boot-options.rst 
> > b/Documentation/x86/x86_64/boot-options.rst
> > index 2b98efb5ba7f..ca46c57b68c9 100644
> > --- a/Documentation/x86/x86_64/boot-options.rst
> > +++ b/Documentation/x86/x86_64/boot-options.rst
> > @@ -297,11 +297,13 @@ iommu options only relevant to the AMD GART hardware 
> > IOMMU:
> >  iommu options only relevant to the software bounce buffering (SWIOTLB) 
> > IOMMU
> >  implementation:
> >
> > -swiotlb=[,force]
> > +swiotlb=[,force][,addr=]
> >
> >  Prereserve that many 128K pages for the software IO bounce 
> > buffering.
> >force
> >  Force all IO through the software TLB.
> > +  addr=
> > +Try to allocate SWIOTLB at defined address
> >
> >  Settings for the IBM Calgary hardware IOMMU currently found in IBM
> >  pSeries and xSeries machines
> > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > index c19379fabd20..83da0caa2f93 100644
> > --- a/kernel/dma/swiotlb.c
> > +++ b/kernel/dma/swiotlb.c
> > @@ -46,6 +46,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #define CREATE_TRACE_POINTS
> >  #include 
> > @@ -102,6 +103,12 @@ unsigned int max_segment;
> >  #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
> >  static phys_addr_t *io_tlb_orig_addr;
> >
> > +/*
> > + * The TLB phys addr may be defined on the command line. Store it here if 
> > it is.

RE: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L 
> 
> This patch reports PASID alloc/free availability to userspace (e.g. QEMU)
> thus userspace could do a pre-check before utilizing this feature.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 28 
>  include/uapi/linux/vfio.h   |  8 
>  2 files changed, 36 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index e40afc0..ddd1ffe 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2234,6 +2234,30 @@ static int vfio_iommu_type1_pasid_free(struct
> vfio_iommu *iommu,
>   return ret;
>  }
> 
> +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> +  struct vfio_info_cap *caps)
> +{
> + struct vfio_info_cap_header *header;
> + struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> +
> + header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> +VFIO_IOMMU_TYPE1_INFO_CAP_NESTING,
> 1);
> + if (IS_ERR(header))
> + return PTR_ERR(header);
> +
> + nesting_cap = container_of(header,
> + struct vfio_iommu_type1_info_cap_nesting,
> + header);
> +
> + nesting_cap->nesting_capabilities = 0;
> + if (iommu->nesting) {

Is it good to report a nesting cap when iommu->nesting is disabled? I suppose
the check should move before vfio_info_cap_add...

> + /* nesting iommu type supports PASID requests (alloc/free)
> */
> + nesting_cap->nesting_capabilities |=
> VFIO_IOMMU_PASID_REQS;

VFIO_IOMMU_CAP_PASID_REQ? to avoid confusion with ioctl cmd
VFIO_IOMMU_PASID_REQUEST...

> + }
> +
> + return 0;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  unsigned int cmd, unsigned long arg)
>  {
> @@ -2283,6 +2307,10 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
>   if (ret)
>   return ret;
> 
> + ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
> + if (ret)
> + return ret;
> +
>   if (caps.size) {
>   info.flags |= VFIO_IOMMU_INFO_CAPS;
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 298ac80..8837219 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -748,6 +748,14 @@ struct vfio_iommu_type1_info_cap_iova_range {
>   struct  vfio_iova_range iova_ranges[];
>  };
> 
> +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
> +
> +struct vfio_iommu_type1_info_cap_nesting {
> + struct  vfio_info_cap_header header;
> +#define VFIO_IOMMU_PASID_REQS(1 << 0)
> + __u32   nesting_capabilities;
> +};
> +
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> 
>  /**
> --
> 2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning

2020-03-30 Thread Liu, Yi L
> From: Tian, Kevin 
> Sent: Monday, March 30, 2020 5:20 PM
> To: Liu, Yi L ; alex.william...@redhat.com;
> Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for 
> quota
> tuning
> 
> > From: Liu, Yi L 
> > Sent: Monday, March 30, 2020 4:53 PM
> >
> > > From: Tian, Kevin 
> > > Sent: Monday, March 30, 2020 4:41 PM
> > > To: Liu, Yi L ; alex.william...@redhat.com;
> > > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1
> > > parameter
> > for quota
> > > tuning
> > >
> > > > From: Liu, Yi L 
> > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > >
> > > > From: Liu Yi L 
> > > >
> > > > This patch adds a module option to make the PASID quota tunable by
> > > > administrator.
> > > >
> > > > TODO: needs to think more on how to  make the tuning to be per-process.
> > > >
> > > > Previous discussions:
> > > > https://patchwork.kernel.org/patch/11209429/
> > > >
> > > > Cc: Kevin Tian 
> > > > CC: Jacob Pan 
> > > > Cc: Alex Williamson 
> > > > Cc: Eric Auger 
> > > > Cc: Jean-Philippe Brucker 
> > > > Signed-off-by: Liu Yi L 
> > > > ---
> > > >  drivers/vfio/vfio.c | 8 +++-
> > > >  drivers/vfio/vfio_iommu_type1.c | 7 ++-
> > > >  include/linux/vfio.h| 3 ++-
> > > >  3 files changed, 15 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > > d13b483..020a792 100644
> > > > --- a/drivers/vfio/vfio.c
> > > > +++ b/drivers/vfio/vfio.c
> > > > @@ -2217,13 +2217,19 @@ struct vfio_mm
> > *vfio_mm_get_from_task(struct
> > > > task_struct *task)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > >
> > > > -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min,
> > > > +int
> > max)
> > > >  {
> > > > ioasid_t pasid;
> > > > int ret = -ENOSPC;
> > > >
> > > > mutex_lock(&vmm->pasid_lock);
> > > >
> > > > +   /* update quota as it is tunable by admin */
> > > > +   if (vmm->pasid_quota != quota) {
> > > > +   vmm->pasid_quota = quota;
> > > > +   ioasid_adjust_set(vmm->ioasid_sid, quota);
> > > > +   }
> > > > +
> > >
> > > It's a bit weird to have quota adjusted in the alloc path, since the
> > > latter
> > might
> > > be initiated by non-privileged users. Why not doing the simple math
> > > in
> > vfio_
> > > create_mm to set the quota when the ioasid set is created? even in
> > > the
> > future
> > > you may allow per-process quota setting, that should come from
> > > separate privileged path instead of thru alloc..
> >
> > The reason is the kernel parameter modification has no event which can
> > be used to adjust the quota. So I chose to adjust it in pasid_alloc
> > path. If it's not good, how about adding one more IOCTL to let user-
> > space trigger a quota adjustment event? Then even non-privileged user
> > could trigger quota adjustment, the quota is actually controlled by
> > privileged user. How about your opinion?
> >
> 
> why do you need an event to adjust? As I said, you can set the quota when the 
> set is
> created in vfio_create_mm...

oh, it's to support runtime adjustments. I guess it may be helpful to let
per-VM quota tunable even the VM is running. If just set the quota in
vfio_create_mm(), it is not able to adjust at runtime.

Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Monday, March 30, 2020 4:53 PM
> 
> > From: Tian, Kevin 
> > Sent: Monday, March 30, 2020 4:41 PM
> > To: Liu, Yi L ; alex.william...@redhat.com;
> > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter
> for quota
> > tuning
> >
> > > From: Liu, Yi L 
> > > Sent: Sunday, March 22, 2020 8:32 PM
> > >
> > > From: Liu Yi L 
> > >
> > > This patch adds a module option to make the PASID quota tunable by
> > > administrator.
> > >
> > > TODO: needs to think more on how to  make the tuning to be per-process.
> > >
> > > Previous discussions:
> > > https://patchwork.kernel.org/patch/11209429/
> > >
> > > Cc: Kevin Tian 
> > > CC: Jacob Pan 
> > > Cc: Alex Williamson 
> > > Cc: Eric Auger 
> > > Cc: Jean-Philippe Brucker 
> > > Signed-off-by: Liu Yi L 
> > > ---
> > >  drivers/vfio/vfio.c | 8 +++-
> > >  drivers/vfio/vfio_iommu_type1.c | 7 ++-
> > >  include/linux/vfio.h| 3 ++-
> > >  3 files changed, 15 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > > index d13b483..020a792 100644
> > > --- a/drivers/vfio/vfio.c
> > > +++ b/drivers/vfio/vfio.c
> > > @@ -2217,13 +2217,19 @@ struct vfio_mm
> *vfio_mm_get_from_task(struct
> > > task_struct *task)
> > >  }
> > >  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > >
> > > -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min, int
> max)
> > >  {
> > >   ioasid_t pasid;
> > >   int ret = -ENOSPC;
> > >
> > >   mutex_lock(&vmm->pasid_lock);
> > >
> > > + /* update quota as it is tunable by admin */
> > > + if (vmm->pasid_quota != quota) {
> > > + vmm->pasid_quota = quota;
> > > + ioasid_adjust_set(vmm->ioasid_sid, quota);
> > > + }
> > > +
> >
> > It's a bit weird to have quota adjusted in the alloc path, since the latter
> might
> > be initiated by non-privileged users. Why not doing the simple math in
> vfio_
> > create_mm to set the quota when the ioasid set is created? even in the
> future
> > you may allow per-process quota setting, that should come from separate
> > privileged path instead of thru alloc..
> 
> The reason is the kernel parameter modification has no event which
> can be used to adjust the quota. So I chose to adjust it in pasid_alloc
> path. If it's not good, how about adding one more IOCTL to let user-
> space trigger a quota adjustment event? Then even non-privileged
> user could trigger quota adjustment, the quota is actually controlled
> by privileged user. How about your opinion?
> 

why do you need an event to adjust? As I said, you can set the quota
when the set is created in vfio_create_mm...

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning

2020-03-30 Thread Liu, Yi L
> From: Tian, Kevin 
> Sent: Monday, March 30, 2020 4:41 PM
> To: Liu, Yi L ; alex.william...@redhat.com;
> Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for 
> quota
> tuning
> 
> > From: Liu, Yi L 
> > Sent: Sunday, March 22, 2020 8:32 PM
> >
> > From: Liu Yi L 
> >
> > This patch adds a module option to make the PASID quota tunable by
> > administrator.
> >
> > TODO: needs to think more on how to  make the tuning to be per-process.
> >
> > Previous discussions:
> > https://patchwork.kernel.org/patch/11209429/
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Signed-off-by: Liu Yi L 
> > ---
> >  drivers/vfio/vfio.c | 8 +++-
> >  drivers/vfio/vfio_iommu_type1.c | 7 ++-
> >  include/linux/vfio.h| 3 ++-
> >  3 files changed, 15 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index d13b483..020a792 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -2217,13 +2217,19 @@ struct vfio_mm *vfio_mm_get_from_task(struct
> > task_struct *task)
> >  }
> >  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> >
> > -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min, int max)
> >  {
> > ioasid_t pasid;
> > int ret = -ENOSPC;
> >
> > mutex_lock(&vmm->pasid_lock);
> >
> > +   /* update quota as it is tunable by admin */
> > +   if (vmm->pasid_quota != quota) {
> > +   vmm->pasid_quota = quota;
> > +   ioasid_adjust_set(vmm->ioasid_sid, quota);
> > +   }
> > +
> 
> It's a bit weird to have quota adjusted in the alloc path, since the latter 
> might
> be initiated by non-privileged users. Why not doing the simple math in vfio_
> create_mm to set the quota when the ioasid set is created? even in the future
> you may allow per-process quota setting, that should come from separate
> privileged path instead of thru alloc..

The reason is the kernel parameter modification has no event which
can be used to adjust the quota. So I chose to adjust it in pasid_alloc
path. If it's not good, how about adding one more IOCTL to let user-
space trigger a quota adjustment event? Then even non-privileged
user could trigger quota adjustment, the quota is actually controlled
by privileged user. How about your opinion?

Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L 
> 
> This patch adds a module option to make the PASID quota tunable by
> administrator.
> 
> TODO: needs to think more on how to  make the tuning to be per-process.
> 
> Previous discussions:
> https://patchwork.kernel.org/patch/11209429/
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> ---
>  drivers/vfio/vfio.c | 8 +++-
>  drivers/vfio/vfio_iommu_type1.c | 7 ++-
>  include/linux/vfio.h| 3 ++-
>  3 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index d13b483..020a792 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -2217,13 +2217,19 @@ struct vfio_mm *vfio_mm_get_from_task(struct
> task_struct *task)
>  }
>  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> 
> -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min, int max)
>  {
>   ioasid_t pasid;
>   int ret = -ENOSPC;
> 
>   mutex_lock(&vmm->pasid_lock);
> 
> + /* update quota as it is tunable by admin */
> + if (vmm->pasid_quota != quota) {
> + vmm->pasid_quota = quota;
> + ioasid_adjust_set(vmm->ioasid_sid, quota);
> + }
> +

It's a bit weird to have quota adjusted in the alloc path, since the latter 
might
be initiated by non-privileged users. Why not doing the simple math in vfio_
create_mm to set the quota when the ioasid set is created? even in the future
you may allow per-process quota setting, that should come from separate 
privileged path instead of thru alloc...

>   pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
>   if (pasid == INVALID_IOASID) {
>   ret = -ENOSPC;
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 331ceee..e40afc0 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -60,6 +60,11 @@ module_param_named(dma_entry_limit,
> dma_entry_limit, uint, 0644);
>  MODULE_PARM_DESC(dma_entry_limit,
>"Maximum number of user DMA mappings per container
> (65535).");
> 
> +static int pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> +module_param_named(pasid_quota, pasid_quota, uint, 0644);
> +MODULE_PARM_DESC(pasid_quota,
> +  "Quota of user owned PASIDs per vfio-based application
> (1000).");
> +
>  struct vfio_iommu {
>   struct list_headdomain_list;
>   struct list_headiova_list;
> @@ -2200,7 +2205,7 @@ static int vfio_iommu_type1_pasid_alloc(struct
> vfio_iommu *iommu,
>   goto out_unlock;
>   }
>   if (vmm)
> - ret = vfio_mm_pasid_alloc(vmm, min, max);
> + ret = vfio_mm_pasid_alloc(vmm, pasid_quota, min, max);
>   else
>   ret = -EINVAL;
>  out_unlock:
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 75f9f7f1..af2ef78 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -106,7 +106,8 @@ struct vfio_mm {
> 
>  extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
>  extern void vfio_mm_put(struct vfio_mm *vmm);
> -extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm,
> + int quota, int min, int max);
>  extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> 
>  /*
> --
> 2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)

2020-03-30 Thread Tian, Kevin
> From: Liu, Yi L 
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L 
> 
> For a long time, devices have only one DMA address space from platform
> IOMMU's point of view. This is true for both bare metal and directed-
> access in virtualization environment. Reason is the source ID of DMA in
> PCIe are BDF (bus/dev/fnc ID), which results in only device granularity

are->is

> DMA isolation. However, this is changing with the latest advancement in
> I/O technology area. More and more platform vendors are utilizing the PCIe
> PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> address spaces as identified by their individual PASIDs. For example,
> Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> let device access multiple process virtual address space by binding the

"address space" -> "address spaces"

"binding the" -> "binding each"

> virtual address space with a PASID. Wherein the PASID is allocated in
> software and programmed to device per device specific manner. Devices
> which support PASID capability are called PASID-capable devices. If such
> devices are passed through to VMs, guest software are also able to bind
> guest process virtual address space on such devices. Therefore, the guest
> software could reuse the bare metal software programming model, which
> means guest software will also allocate PASID and program it to device
> directly. This is a dangerous situation since it has potential PASID
> conflicts and unauthorized address space access. It would be safer to
> let host intercept in the guest software's PASID allocation. Thus PASID
> are managed system-wide.
> 
> This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> passdown
> PASID allocation/free request from the virtual IOMMU. Additionally, such

"Additionally, because such"

> requests are intended to be invoked by QEMU or other applications which

simplify to "intended to be invoked from userspace"

> are running in userspace, it is necessary to have a mechanism to prevent
> single application from abusing available PASIDs in system. With such
> consideration, this patch tracks the VFIO PASID allocation per-VM. There
> was a discussion to make quota to be per assigned devices. e.g. if a VM
> has many assigned devices, then it should have more quota. However, it
> is not sure how many PASIDs an assigned devices will use. e.g. it is

devices -> device

> possible that a VM with multiples assigned devices but requests less
> PASIDs. Therefore per-VM quota would be better.
> 
> This patch uses struct mm pointer as a per-VM token. We also considered
> using task structure pointer and vfio_iommu structure pointer. However,
> task structure is per-thread, which means it cannot achieve per-VM PASID
> alloc tracking purpose. While for vfio_iommu structure, it is visible
> only within vfio. Therefore, structure mm pointer is selected. This patch
> adds a structure vfio_mm. A vfio_mm is created when the first vfio
> container is opened by a VM. On the reverse order, vfio_mm is free when
> the last vfio container is released. Each VM is assigned with a PASID
> quota, so that it is not able to request PASID beyond its quota. This
> patch adds a default quota of 1000. This quota could be tuned by
> administrator. Making PASID quota tunable will be added in another patch
> in this series.
> 
> Previous discussions:
> https://patchwork.kernel.org/patch/11209429/
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Yi Sun 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/vfio/vfio.c | 130
> 
>  drivers/vfio/vfio_iommu_type1.c | 104
> 
>  include/linux/vfio.h|  20 +++
>  include/uapi/linux/vfio.h   |  41 +
>  4 files changed, 295 insertions(+)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c848262..d13b483 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #define DRIVER_VERSION   "0.3"
>  #define DRIVER_AUTHOR"Alex Williamson
> "
> @@ -46,6 +47,8 @@ static struct vfio {
>   struct mutexgroup_lock;
>   struct cdev group_cdev;
>   dev_t   group_devt;
> + struct list_headvfio_mm_list;
> + struct mutexvfio_mm_lock;
>   wait_queue_head_t   release_q;
>  } vfio;
> 
> @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev,
> enum vfio_notify_type type,
>  EXPORT_SYMBOL(vfio_unregister_notifier);
> 
>  /**
> + * VFIO_MM objects - create, release, get, put, search

why capitalizing vfio_mm?

> + * Caller of the function should have held vfio.vfio_mm_lock.
> + */
> +static struct vfio_mm *vfio_create_mm(struct mm_stru